Data

Grey Matter Data is a platform service for the versioned and encrypted storage of media blobs and assets.

Note that we mark sensitive values with "❗" so that it is clear what must be kept private, versus what is safely made public.

API Basics

Grey Matter Data uses JavaScript Web Tokens (JWT) for authentication. Each request to Grey Matter Data must include a cookie within the header that is based on the authentication JWT. Grey Matter Data tracks information through Event Objects. These Event Objects capture all changes and reflect the Kafka event queue that supports the system. Each Event Object is associated with each file through Object ID (oid) parameters. The parameters of the Event Object form relationships between files in the system.

JWT Service

A JWT service, such as jwt-server, assumes a system has authenticated you via proxy, and it will insert the USER_DN header. The JWT service will take a redirect argument and a path argument. The path is the URLs over which the cookie will be sent. The redirect is an URL in the path. The cookie is written out with name userpolicy and with HttpOnly set to true, preventing client scripts from accessing this cookie.

The JWT token includes claims with the following format:

  • Label: the name that the token will be logged under goes here.

  • Values: a hashtable from string to lists of strings is used to evaluate the JWT token against an objectPolicy.

Here is an example of a JWT token representing a userPolicy.

{
"label": "asRob",
"values": {
"age": [
"driving"
],
"citizenship": [
"USA"
],
"email": [
"rob.johnson@email.com"
],
"name": [
"rob johnson"
],
"org": [
"www.deciphernow.com"
]
}
}

Event

Grey Matter Data tracks all changes through JSON Event object. Events represent a portion (limited by user’s security access) of the Kafka messaging queue that supports Grey Matter Data. Any modification to the system will be carried out through the /write endpoint by supplying a single or multiple Events that describes required actions.

Event parameters define the relationships between files in Grey Matter Data. For example, the parentoid parameter defines folder-to-child relationships. Updates will effectively move an Object from one folder to another. Parameter derived will point to Object IDs related to the current oid. For example, the thumbnails that might be derived from an image can be pointed to that image through the derived Event parameter.

For a conceptual insight into why gm-data is designed the way that it is: Event Sourcing

Event Examples

Full Event Structure

[
{
"action": "string", // Actions on a file to put it into its current state.
"blobalgorithm": "string", // The blob algorithm; if not specified, we store in S3 with SSECustomerKey.
"checkedtstamp": "string", // the tstamp of the previous version of the oid we compared against.
"custom": "string", // Put custom fields in here that gm-data does not understand, but need to be tracked by the application.
"defaultfile": "string", // For a directory, if we try to stream the directory, ${defaultfile} will be the assumed name. index.html is the default value.
"derived": "string", // Fields to denote that a file is derived from another - ie: doc to pdf. (oid,tstamp) are a primary key for the original. Type allows us to track what kind of derivation it is, such as doc-to-txt.
"description": "string", // Description of what is in the file or directory.
"encrypted": "string", // Allow for fields to be encrypted on a case-by-case basis.
"expiration": "string", // tstamp as of which this record will not come back from queries. May legally require purges at some point.
"isfile": "string", // Set this if it is a file. You will get a directory if you don't set this.
"mimetype": "string", // same as content-type. Set for this object, and also for S3 blobs.
"name": "string", // file/dir name without pathing.
"objectpolicy": "string", // The rules that allow action (C R U D X P).
"oid": "string", // A numeric identifier assigned to a file/dir. Happens to be a nanoseconds timestamp. (oid, tstamp) are the primary key for these events.
"parentoid": "string", // oid of the parent directory.
"purgetstamp": "string", // In order to just purge a single event, supply a purgetstamp along with the oid.
"references": "string", // When uploading files, an array of updates can be supplied; so that we can upload files into directories that don't yet exist. Indices are negative relative values.
"rname": "string", // ${AWS_S3_BUCKET}/${AWS_S3_PARTITION}/${rname} is where this object is in S3. Name is assigned even if only Local is used.
"schema": "string", // Version of the schema from which this object came.
"security": "string", // Security labels are written here, along with their foreground/background pen colors.
"sha256plain": "string", // sha256 of the plaintext, can be used in the client to calculate a minimum number of files to send for update.
"size": "string", // Same as Content-Length.
"tstamp": "string", // Nanosecond timestamp of this objects creation time - unique per event.
"userpolicy": "string" // JWT claims used when creating this event.
}
]

Upload Event Examples

Prerequisites:Policy

Before you can upload anything you need a policy to be applied to what you upload. This is very much like the requirement in AWS to upload IAM to specify how a capability is locked down. It is mandatory to attach a policy to any modification to the system.

Either applications or UIs may write these policies for you (ie: a CAPCO UI editor if you work for the government, or just writing simple expressions to do this directly without the assistance of a UI). As an example, a policy is a function that takes in your JWT claims as output, and may emit permissions as output. Every step is a function that takes args.

You can think of this as a function in another language:

# The policy language is NOT Python, but this is an imperfect analogy
# The function may have been defined as a built-in, or as a user-added
# macro.
def f(x, y, z):
return (R,X)
# And this is calling a function
f(x, y, z)

But for many reasons, we have a very tiny evaluation engine that executes precompiled statements. The syntax is LISP (not Python, or Javascript, or whatever) for reasons that have to do with being able to manipulate the code as a straightforward data structure, which is impossible in other syntaxes.

This is a simple function f with three argument values x, y, and z. Function evaluation is triggered by putting a parenthesis around the function and its args.:

(f x y z)

This is a value that is not executed (no parenthesis):

x

A function g with no args would be executed like:

(g)

It isn't obvious, but this is a very common syntax for embedded dynamic code into larger programs, because it has simple mathematical rules to manipulate the code as data, due to a completely uniform syntax. It is so easy to write these evaluators that they may show up as stored functions in databases (like MySQL) to filter for access. The runtimes are usually under a hundred lines, because there are no operator precedences, or corner cases. You don't need to know what function (g) is ahead of time, other than looking it up in a symbol table to plug its args into it. A competitor language to this under standardization is OPA, which is far more complex than this, possibly too complex to expose to users; but OPA may end up in use here in future generations - and use assisted by UIs. If you use a CAPCO UI, then you are spared from knowing much about this. But we must have policies attached to files.

The most important function is the if function:

(if BOOLEANCONDITION
TRUEBRANCH
FALSEBRANCH)

When the first argument to if is true, the whole expression actually reduces to:

TRUEBRANCH

And when it's false, it reduces to

FALSEBRANCH

Many statements only have a TRUEBRANCH and just reduce to empty when BRANCHCONDITION is false. As an example:

(if FALSE
TRUEBRANCH)

Just reduces to an implicit FALSEBRANCH, in which nothing is done. The next most important statement is contains (which is equivalent to has some). Contains implicitly reads our JWT claims input. If our JWT claims look like:

{
"values": {
"email": [ "rob.fielding@gmail.com" ]
}
}

The only constraint is that the JWT has a values field that is a map from string to an array of strings. That really is the only requirement that we impose. JWT servers ensure that their JWTs have such a values field that can be operated on by policy.

So, if in our session, our JWT claims are as above; and the object is protected with the policy below:

(if (contains email rob.fielding@gmail.com)
(allow-all)
(allow-read)
)

The contains function looks in jwt["values"]["email"] to return true if rob.fielding@gmail.com is in the set. And it is in there, so the policy reduces to the function call:

(allow-all)

What is that? It's a macro for (yield C R U D X P). So, we have been given permission for: Create, Read, Update, Delete, Execute, and Purge; for this object. Had our email had "rob.fielding@greymatter.io", it would reduce to:

(allow-read)

Which happens to be a macro for (yield R X), which means that we can Read the metadata about a file, or "execute" the file, by downloading the file bytes.

Also, it happens to be the the objectpolicy field is this LISP compiled down to json. This is great for Mongo, because Mongo just stores json in a binary encoding, to avoid pointless parsing. this is a yaml rendition of it:

f: if
a:
- f: contains
a:
- v: email
- v: rob.fielding@gmail.com
- f: allow-all
- f: allow-read

f is a function. a is an arg list. v is a value.

Given this small set of functions, you can evaluate any JWT token against any policy attached to a candidate object.

  • (if BOOLEAN TRUEBRANCH FALSEBRANCH) ... boolean conditions

  • (contains FIELDNAME VAL1 VAL2 ...) ... contains FIELDNAME on a value

  • (and ARG0 ARG1 ARG2 ....) ... and for as many args as you want

  • (or ARG0 ARG1 ARG2 ....) ... or for as many args as you want

  • (not ARG0) ... the not function.

  • (yield ARG0 ARG1 ARG2 ...) ... when permissions are returned.

This approach completely separates any customer-specific authorization system away from gm-data. gm-data only knows what these primitive functions mean. That keeps things backwards compatible, because when you need new behavior, ask for a new FUNCTION rather than changing the meaning of existing functions.

Almost every policy takes this by default, to have a file that is owned by its uploader, and readable by anyone. The /static/ui will do this by default when you upload files and make no modifications to policy before it goes up:

This is the most important example. The /static/ui embedded uses the information in the GET /config call to properly formulate policies like this

(if (contains userDN "cn=daveg,dc=greymatter,dc=io")
(allow-all)
(allow-read)
)

"If it's my userDN, then I have full control. Everyone else has read access."

If JWT says: values["userDN"][i] == "cn=daveg,dc=greymatter,dc=io" for any value i, then the boolean condition is true, and (allow-all) is invoked. If it does not match than it invokes (allow-read).

You will often see (yield R X) for (allow-read) or (yield-all) instead of (allow-all). Do a GET /macros/1/ against gm-data to see any macros added into the system.

Gotchas:

  • The booleans and, or, and not don't really make sense anywhere other than in the first argument to if, where the point is to return a true or false answer, where we check strings in the jwt with the contains function (and its more general has function, like has some (or across args), or has every (and across all args). Each branch of an if should start with an if or a yield (which might be in macros like allow-read or allow-all.

  • This is not a general language. There are no variables, or local function definitions. That would happen if we migrate over to OPA, but the main use of this language is: given a JWT's digitally signed claims, iterate over many thousands of objects to instantly determine what a user is allowed to do on each of them. We can filter directory listings at many thousands per second with this language. And doing it this way prevents organizationally-specific authentication systems code from creeping into gm-data.

Note: when action: “C” (create / upload) is specified, the system will backfill Object ID when created internally. In this case, you should not specify oid parameter. Note: when action: “C” (create / upload) is specified, the parameters below are the bare minimum that must be specified for the create action to complete internally.

const events = [
{
action: "C",
name: ”New Folder,
isfile: false,
parentoid: 1,
originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
security: {
label: "DECIPHER//GMDATA",
foreground: "#FFFFFF",
background: "#00FF00"
}
}
];

Delete Event Example

Note: when action: “D” (delete) is specified, the system will backfill most of the Object parameters, thus it is only necessary to specify oid, action, parentoid, and objectpolicy.

const events = [
{
action: "D",
oid: 42,
parentoid: 1,
originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
}
];

Rename Event Example

Note: when action: “U” (update) is specified, all parameters – except the few that are being updated and tstamp – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

const events = [
{
action: "U",
oid: 42,
parentoid: 1,
name: "New Name",
originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
security: {
label: "DECIPHER//GMDATA",
foreground: "#FFFFFF",
background: "#00FF00"
}
}
];

Move Event Example

Note: when action: “U” (update) is specified, all parameters – except the few that are being updated and tstamp – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

const events = [
{
action: "U",
oid: 42,
parentoid: 2,
name: "New Name",
originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
security: {
label: "DECIPHER//GMDATA",
foreground: "#FFFFFF",
background: "#00FF00"
}
}
];

This section introduced several Events Objects that Grey Matter Data uses to track information. Understanding these objects will help you perform the following actions: uploading (action:“C”), moving, renaming, altering (action: “U”), or removing (action: “D”).

Object ID

Grey Matter Data tracks stored data through unique Object IDs that are assigned on upload of files into the system. Relationships between Object IDs are established through the parentoid parameter of the Event. Creating an update Event with a new parentoid effectively moves an Object to a new folder. Learn more in the /write endpoint section.

When an Event with (param){action: “C”} (create) is sent into the system on upload through /write endpoint, (param){oid:} does not need to be specified. The system will assign it to this Event internally.

API Overview

This section covers accessing and manipulating data within Grey Matter Data.

We begin with the overarching concepts of information retrieval and information modification. Then we dive deeper into specifics of each API endpoint and code examples.

Information Retrieval

When starting to use the API, you will most likely direct your first request at the root folder to get initial file listings. You can accomplish this with a GET request to the /list endpoint with path of /1 (GET /list/1/).

The root directory has the well known Object ID (oid) of 1 by default. This will be the root folder for each user. However due to specific permissions prescribed through authentication JWT, each user will only be able to see and manipulate a subset of folders.

You can extract data from the system in the following three ways, leveraging numerous read endpoints:

  1. As a JSON Object mimicking internal Events Object through one of the read endpoints

  2. As a raw byte stream through the /stream endpoint, used to download the Object locally, and

  3. As a raw byte stream within an iFrame that displays security meta data of the Object, through the /show endpoint, used to view the Object within the browser window.

More information regarding each of those methods can be found in respected endpoint sections (/read, /stream, /show)

Information Modification

The only way to modify the content of Grey Matter Data is through the /write endpoint. When request is sent to /write endpoint, the request body has to carry form data with an appended {'meta': [Event ]} object.

More details regarding data modification can be found in the endpoints /write section.

Prerequisites

When authenticating to the API, there is a prioritized set of options. Our JWT is a format that allows for LDAP-like groups. They are signed by our signer that we trust, and have a label field that has the username or generic name to be logged in audits. It has the values field which is a map[string][]string, which is to say a set of multi-valued values; similar to LDAP groups. This is done so that we can write policies as boolean combinations of these attributes. In short, you need a userpolicy somehow as a prerequisite to make use of this API. The order we look is:

  • http parameter setuserpolicy set to a JWT, which we turn into a setcookie and re-forward you back in with this parameter removed. This may be used in setups without a JWT server or an edge proxy.

  • http parameter userpolicy set to a JWT. This may be used in setups without a JWT server or an edge proxy.

  • cookie userpolicy set to a JWT. This may be used in setups without a JWT server or an edge proxy.

  • http header userpolicy set to a JWT, and is set by the edge server, usually using USER_DN header as input. This is used in conjunction with the JWT filter.

  • configurable header USER_DN, which we trust was securely set by the edge server(!!). This can be used to look up a JWT in the JWT server. This must be used with an edge proxy with inheaders enabled.

  • anonymous.

API File System Examples

This section covers multiple examples of http request configurations and explains the results they return.

Common GET Requests

All requests in this section can be accomplished by modifying javaScript code presented below.

const requestURL = `${gmDataEndpoint}/list/1`;
axios.get(requestURL, {
// necessary to pass on server set HttpOnly authentication cookies
withCredentials: "true"
})
.then(resp => console.log(resp.data)
.catch(error => console.log(error));

Request Method

Endpoint URL

Request Body

Credentials Include

Description

GET

/list/1

None

True

Get a listing of the root Object ID (oid) of 1, choosing a path / relative to it. / symbol at the end of listing path URL is mandatory. Each folder within /1 root folder will have its own unique security policy thus limiting access to groups of users. Each user navigating to /1 folder will see a unique folder landscape tailored by their security credentials.

GET

/list/1/Project1Folder

None

True

This returns listings for Project1Folder, a folder that is child of root folder. This folder may have unique security settings rendering it invisible to groups of users.

GET

/list/42/

None

True

If the Project1Folder dir had an Object ID (oid) of 42, then this would be an equivalent URL to list it. Note how we include / symbol at the end of the path.

GET

/props/42/

None

True

This URL would produce the metadata about the Project1Folder directory. Once we have found an Object that we are looking for, we can perform operations on it.

GET

/stream/900/

None

True

This will produce a bytestream of an Object with Object ID (oid) of 900. Presume this Object’s name property is resume.pdf.

GET

/stream/42/resume.pdf

None

True

The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.

GET

/props/900/

None

True

The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.

GET

/history/900/

None

True

A list of Event Objects for every state of resume.pdf, ordered by time stamp of the Event.

GET

/show/900/

None

True

Is a convenience wrapper around stream to show an html security banner with file’s security metadata around the byte stream.

Common POST Requests

Above GET requests can be dispatched separately or in bulk using POST request to the /read endpoint. This lets you minimize back-and-forth HTTP traffic to improve performance in low bandwidth situations.

Request Method

Endpoint URL

Request Body

Credentials Include

Description

POST

/read

stringified([{URL:”/list/900/“}, {URL:”/list/42/“}])

True

This endpoints requires a string encoded array in the body of the request in the following form: [{URL:”/list/900/“}, {URL:”/list/42/“}]. A detailed example can be found in the Read endpoint section. This call will yield an array with data identical to the same calls performed individually using GET requests. In this specific example, we list two directories simultaneously. This allows for quick file system exploration with significantly fewer requests.

POST

/read

stringified([{URL:”/history/900/?count=10“}, {URL:”/history/42/?count=10“}])

True

Simultaneously getting last 10 revisions of 2 separate Object IDs

POST

/read

stringified([{URL:”/derived/900/“}, {URL:”/derived/42“}])

True

Simultaneously getting derived file meta data from 2 separate Object IDs.

To get data into the system, a request with attached multi-part/form-data needs to be performed to /write endpoint. The transaction is an array of individual JSON Event Objects, in the order in which they need to be applied in the database (optionally including file objects in BLOB format appended to the form data when performing an upload). Detailed examples can be found in the /write endpoint section.

Request Method

Endpoint URL

Request Body

Credentials Include

Description

POST

/write

form data [{'meta':[Event1Object]}]

True

This endpoints requires a form data with appended array of Event Objects under ‘meta’ property, specifying a modification to the system. Detailed example can be found in the /write endpoint section.

POST

/write

form data [{'meta':[Event1Object, Event2Object]}]

True

This endpoint can accept multiple Event objects at the same time.

POST

/write

form data [{'meta':[Event1Object, Event2Object]}, {'blob':[BLOB1]}, {'blob':[BLOB2]}]

True

This endpoint can accept multiple Event objects at the same time.

Common Request Error Response Codes

HTTP Error Code

Common Causes

400

Bad Request code is most often caused when using /write endpoint and Event Object in form data is malformed.

403

Forbidden code is most often caused when JWT authentication token doesn't match Object's privileges.

404

Not Found code is most often caused when Object ID (oid) that is specified in the request is incorrect

Command Line Interface and Go Client Package

There is a command-line interface to support bulk, and automated scenarios. This should help ease the implementation burden for some very common tasks:

The CLI commands all need to be able to connect in an authenticated manner, so there are environment variables associated with connecting. Here is an example of connecting to a PKI enabled setup. The environment variables only need to be set once in a script. After environment variables are setup:

#!/bin/bash
# Name this script:
# gmdatatool.sh
## Environmental setup - depends on how gm-data TLS and address is configured
(
u=`uname`
if [ "${u}" == "Darwin" ]
then
b64="base64"
else
b64="base64 -w 0"
fi
export MONGO_USE_TLS=false
export CLIENT_PORT=9443
export CLIENT_CN=localhost
export CLIENT_ADDRESS=localhost
export CLIENT_PREFIX=/services/gmdatax/latest
export CLIENT_USE_TLS=true
# wherever your certs are
export CLIENT_CERT=`cat ../../certs/localhost.crt | ${b64}`
export CLIENT_KEY=`cat ../../certs/localhost.key | ${b64}`
export CLIENT_TRUST=`cat ../../certs/intermediate.crt | ${b64}`
export MONGO_USE_TLS=false
./gmdatatool.linux $*
)
# Create a directory that we control under self-service directory /world
./gmdatatool.sh mkdir --securitylabel "localuser owned" \
--securityfg "white" \
--securitybg "red" \
--policylabel "localuser owned" \
--objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
/world/localuser@deciphernow.com
# Upload an entire application into /world/localuser@deciphernow.com
./gmdatatool.sh upload --securitylabel "SECRET" \
--securityfg "white" \
--securitybg "red" \
--policylabel "localuser owned" \
--objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
/world/localuser@deciphernow.com ../../static/ui
)

The tool ./gmdatatool.sh is a special-case use of the go package github.com/deciphernow/gm-data/client. The client is based around two important ideas:

  • Listening for changes in gm-data, and invoking callbacks when they happen.

  • Providing an API to respond to changes. Example uses:

    • Statically generated thumbnails

    • Run AWS Rekognition to upload derived files on images, such as object-labelling.

    • The written back files are json, and they point to the image that they are derived from

  • Responding to changes may happen through REST or Kafka.

There is a responder, with REST or Kafka constructors. The REST constructor filters out information based on objectPolicy (ie: it runs as a real user). The Kafka constructor runs on a privileged, unfiltered view of all events that happen on gm-data. Generally, the Kafka view is appropriate for back-end processes. The REST constructor is usable from front-end (ie: not originating from within Fabric itself, possibly even from web browsers calling the /notifications endpoint), or back-end.

// Create a client at the root
c, err := client.NewRESTResponder(
logger,
client.GetURL(),
getClient(),
listing.DefaultRootOID,
policy.CurrentTstamp(),
1000,
time.Duration(2)*time.Second,
client.CLIENT_IDENTITY.Str(),
func(c *client.Responder, ev *listing.Event) error {
return nil
},
)
if err != nil {
log.Printf("create client failed: %v", err)
panic(err)
}

This responder will poll every second for new information, and get up to 1000 events at a time. The callback allows us to inspect events with our code. Generally, when we see something interesting in the event (ev), we call different parts of the API:

# Get an io.Reader on ev, as it is a file type that we are interested in
blobData, err := c.StreamOf(ev.Oid, ev.Tstamp)

We may then go do something outside the scope of gm-data, such as turn a blob into a json file (ie: submit a jpg and get back a json description of it). Note that when we are doing listen and write-back like this, we typically end up setting Derived fields, so that we can track the lineage of why the file exists, and what created it. We can correlate a jpg of a face with a json about it, so that we can delete them both if we are asked to delete the file.

m := c.NewWriteMarshaler()
defer m.Close()
err = m.Append(&listing.EventArgs{
Action: policy.ActionUpdate,
IsFile: true,
ParentOID: ev.ParentOID,
Name: newFname,
MimeType: "application/json",
ObjectPolicy: policy.ForReadAllFull,
Derived: listing.Derived{
Oid: ev.Oid,
Tstamp: ev.Tstamp,
Type: kind,
},
Security: ev.Security,
BlobAlgorithm: "none",
}, newFname)
...
req, err := c.NewWriteRequest(m)
...
res, evs, err := c.DoWriteRequest(req)
...

Functions supported by the client API, all required to respond to changes in gm-data with write-backs of new derived files. For things related to read endpoints:

  • NewRESTResponder/NewKafkaResponder - Listen on /notifications, which is the critical reason for having a client library, to respond to changes being made in gm-data.

  • StreamOf - Get the bytes for an (oid,tstamp), where tstamp is optional, so that you get the latest blob.

  • EventOf - Get the properties for an (oid,tstamp), or latest if tstamp is not included.

  • DerivedOf - Find out what is already derived from this file. This is how you could know that a thumbnail already exists for a file.

  • Self - Discover what we are authenticated as, which is important for troubleshooting.

  • HistoryOf - Every event pertaining to an oid. This is the lifecycle of the inode, across all changes (including name, parent, policy, security labels, etc).

Note that more complex paging options are not being used with these simple client libraries.

For things related to the write endpoint, which are a bit more difficult to write directly against the API for yourself than the read endpoints:

  • AppendTree - Perform a bulk upload of a large directory, where you have the opportunity to set security labels and policies individually

  • Append - A raw append to update an individual file or directory

Example use case:

  • GDPR laws require that if a demand is to remove files "about" an individual, that individual can make this demand.

  • In order to comply, if we have a jpg with attached metadata that says that the individual is named in the file, then we can issue a delete on both files.

  • This is possible because we track the Derived file pointers.

  • The /derived endpoint lets us find all files that point to us with a Derived pointer, so that we can find an entire tree of files that started from a single input file. Example: elasticSearchEntry derivedfrom facesIndex, facesIndex derivedfrom jpg

Environment And Deployment

The gm-data service creates a binary called gmdatax.linux, that is configured entirely by environment variables (to avoid a requirement to mount files). This binary however is packaged with some other files.

  • ./runforever - a shell script that keeps ./gmdatax.linux in a re-start loop to handle non-intentional crashes of the binary. This allows us to catch things like array out of bounds, nil pointer dereference, or catastrophic resource exhaustion such as out of file handles. It is these latter cases that drive the decision to allow the binary to die.

  • ./gmdatax.linux - the actual gmdata binary, that reads in environment variables.

  • ./VERSION - the version of this service

  • ./static/ - a bundle of runtime API user documentation, and test user interface. this directory is served literally out of gm-data under the URL /static/

  • ./certs/ - a directory that the binary can write certificates into on startup. the certificates originate from environment variables passed in as single-line base64 encoding full pem files.

  • ./logs/ - a place to write logs (in non-default cases), and may be mounted over to keep the root partition from running out of space.

gm-data will make every possible attempt to look at your configuration and immediately crash with a detailed explanation of what to actually do about it. This includes looking up hostnames in DNS to verify that they exist. Always look in the log files for gm-data if something does not seem right on startup. But it cannot detect inconsistency issues at a higher level, such as one service offering a cert that is then trusted by a service that will try to connect to it. That would require analyzing a larger set of environment variables that are destined for multiple services.

Basic Environment Variables

  • MASTERKEY❗ is mandatory. This is the key that is used to encrypt data.

  • JWT_PUB is the single-line base64 encode of the signing key that the gm-data server trusts to sign JWT tokens. This is a mandatory parameter. It is not an X509 certificate. It is an actual Elliptic Curve key that is suitable for ES512 in the JWT standard.

  • FILE_BUCKET is mandatory (aka: AWS_S3_BUCKET). This says where we write gm-data ciphertext out to AWS.

  • FILE_PARTITION is mandatory (aka: AWS_S3_PARTITION). This should be set to a value that is unique to a set of replicated Fabric clusters. It is literally a subdirectory in FILE_BUCKET. This exists so that we don't need to create lots of buckets constantly, yet can still distinguish which bucket data belongs to which installation.

  • AWS_REGION is required if USES3=true.

  • AWS_S3_ENDPOINT is only required in government setups that need to point to a different hostname for S3.

  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY❗ may be set to give AWS credentials in the case where IAM roles are not used for the EC2 instance. AWS_SECRET_ACCESS_KEY❗ is a secret, obviously.

When you disable S3 use like USES3=false, the bucket and partition are still used. The directory ./buckets/${FILE_BUCKET}/${FILE_PARTITION} should exist and be writable by the gm-data process ./gmdatax.linux.

The JWT_PUB is the public part of an elliptic curve key. The private part of it is PRIVATE_KEY❗ for the JWT server. The parameters for use with the JWT libraries are rather specific, due to the curve name secp521r1. This is how we generate our keypairs, which is done specifically for gm-jwt-security to get a private key for signing (file jwtES512.key❗), and then the public key derived from that (jwtES512.key.pub) and set for gm-data as JWT_PUB.

openssl ecparam -genkey -name secp521r1 -noout -out jwtES512.key
openssl ec -in jwtES512.key -pubout -out jwtES512.key.pub

References Between services

  • Prefix patterns. When gm-data needs to make reference to another service, these are relevant environment variables:

    • CLIENT_PREFIX is the URL that the gateway is mapping gm-data service to. This is done so that we can send back links that resolve properly in html files. We do this because we cannot hardcode even our own service name, and also cannot correctly give a relative path. Example: /services/gmdatax/latest

    • CLIENT_JWT_PREFIX is the URL that the gateway is mapping our peer service gm-jwt-security to. This is done so that we can send back links that resolve properly in html files. Example: /services/jwt-server/1.0, or /services/jwt-server-gov/1.0.

We have explicit dependencies on these things:

  • a JWT token issuer, that has a proper sidecar, and is reachable through the edge

  • a Mongo database, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.

  • Kafka, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.

TLS

Services that use TLS will end up creating a large number of environment variables. We follow a principle of passing in pem files as a single line of base64 of the original pem file. That means that we create such files as environment variables on the host that is preparing the deployment. Here is an example of setting up the trust for our Mongo dependency:

MONGO_TRUST=`cat server.trust.pem | base64 -w 0`

When TLS connections to peer services are involved, this pattern in name suffixes arises:

  • ADDRESS (or HOST) - ip or hostname of the peer.

  • PORT - port for the peer.

  • USE_TLS - use TLS

  • CERT - a base64 single-line encode of the pem cert (which also happens to be multi-line base64).

  • KEY❗ - base64 single-line encode of the pem key (which is a multi-line base64). This is also a secret.

  • TRUST - this is similar to CERT. It may encode a concatenated list of pem files for certs.

  • CN - the ServerName expected. This is usually the same as the CN in the remote cert, but may also be an SNI name that matches a wildcard in the CN. If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.

With that being said, these variables are grouped together.

  • MONGO related connect info

    • MONGOHOST - slightly violates our pattern. This can be a list of host:port pairs, like mongodata:27017,mongodata:27017. This is because in a clustered setting, connections are not made to individual machines, but to entire clusters. The PORT part is already taken care of.

    • MONGODB - is not strictly part of TLS, but we need to know the database that we are connecting to.

    • MONGO_USE_TLS - says whether to use the TLS variables to make a TLS connection.

    • MONGO_CERT - is the client PKI cert that we identify ourselves with.

    • MONGO_KEY❗ - is the key that goes with MONGO_CERT.

    • MONGO_TRUST - is the trust file to connect to Mongo servers.

    • MONGO_CN - is SNI name for the mongo cert, the manually set serverName expected. If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.

    • MONGO_INITDB_ROOT_USERNAME - is the username we will use (not necessarily related to the root username however).

    • MONGO_INITDB_ROOT_PASSWORD - the password for MONGO_INITDB_ROOT_USERNAME. This is a secret of course.

  • GMDATA TLS info, for our own service. This generally only happens when the sidecar egress is mTLS.

    • GMDATA_USE_TLS - Says whether to use TLS. This will need to be coordinated with how our sidecar is setup. Our sidecar EGRESS will need to be a client of this TLS connection.

    • GMDATA_CERT - The identity cert of gmdata that will be presented to sidecar.

    • GMDATA_KEY❗ - The key that goes with GMDATA_CERT

    • GMDATA_TRUST - The sidecar will need to present a cert that is signed by something in this TRUST

  • CLIENT_JWT_ENDPOINT prefixed environment variables are relevant to gmdata looking up userpolicyid (a random key to find a JWT) to get a userpolicy (an actual JWT token). This is only needed in cases where we have a jwt server indirectly via userpolicyid.

    • CLIENT_JWT_ENDPOINT_ADDRESS - is the hostname of the JWT server

    • CLIENT_JWT_ENDPOINT_PORT - is the port of the JWT server

    • CLIENT_JWT_ENDPOINT_USE_TLS

    • CLIENT_JWT_ENDPOINT_CERT

    • CLIENT_JWT_ENDPOINT_KEY

    • CLIENT_JWT_ENDPOINT_CN - Expected SNI name

    • CLIENT_JWT_ENDPOINT_TRUST

    • CLIENT_JWT_ENDPOINT_PREFIX - if we connect directly or to the sidecar, then this is just left at empty string "". But if we go through the edge, which is an unlikely case, this ends up needing to be set to the same value as CLIENT_JWT_PREFIX.

    • JWT_API_KEY - is a base64 password that the JWT server will require to accept connections to resolve access codes for JWT tokens (userpolicyid) to actual JWT tokens (userpolicy).

Note that for the JWT server, we are trying to form a connection URL like:

# proto is either http or https depending on CLIENT_JWT_ENDPOINT_USE_TLS
# cert setup is the normal pattern:
# CLIENT_JWT_ENDPOINT_CERT
# CLIENT_JWT_ENDPOINT_KEY
# CLIENT_JWT_ENDPOINT_TRUST
GET ${proto}://${CLIENT_JWT_ENDPOINT_ADDRESS}:${CLIENT_JWT_ENDPOINT_PORT}${CLIENT_JWT_ENDPOINT_PREFIX}/policies

Internally, gm-data sees a userpolicyid header, and connects to that URL to try to get a userpolicy object, which may be too large to have fit into an http header. Notice that the inclusion of CLIENT_JWT_ENDPOINT_PREFIX exists only to go through the edge instead of the sidecar. In the normal case CLIENT_JWT_ENDPOINT_PREFIX="", because we want to talk to the sidecar.

Examples:

  • Talk to our own local sidecar in plaintext to reach JWT (preferred):

    • CLIENT_JWT_ENDPOINT_PREFIX=/services/jwt-server/latest

    • CLIENT_JWT_ENDPOINT_ADDRESS=gmdata-proxy

    • CLIENT_JWT_ENDPOINT_PORT=8080

  • Talk to a JWT sidecar directly (not preferred):

    • CLIENT_JWT_ENDPOINT_PREFIX=

    • CLIENT_JWT_ENDPOINT_ADDRESS=jwt-server-proxy

    • CLIENT_JWT_ENDPOINT_PORT=8080

CLIENT_JWT_ENDPOINT_USE_TLS may require connecting to a sidecar-issued cert, that may not exist at the time gm-data launches. So, note that using GMDATA_USE_TLS in the mesh may be complicated by this fact.

Miscellaneous parameters

  • DONT_PANIC - is an advanced parameter that says to only WARN, but do not CRASH when inconsistent environment variables are detected. If you run with this setting, you run the risk of creating a setup that we cannot support. Sometimes you need to temporarily ignore known problems. So, this should be disabled as soon as possible if it is ever used.

  • LESS_CHATTY_INFO - by default, we like less chatty logs. If you want a lot more logging information that includes the begin and end of sessions in which there were no problems, then you can set this to false.

  • GMDATAX_SESSION_MAX - is an admission control value. This imposes a limit on the number of outstanding requests gm-data will allow to be concurrently serviced. It is literally a maximum population at which gm-data just issues 503 to tell the client to get out of line, and come back later. It exists because if we run out of filehandles, the server will become unstable and crash in an irregular manner. If this server runs out of filehandles, than GMDATAX_SESSION_MAX should be lowered to a value that causes us to stop running out of filehandles. It may need to be raised if we get 503 errors that actually originate from gm-data itself. Our proxy may also issue 503 in the case of admission control, which complicated determining which one ran out. It is more likely that Envoy will run out of filehandles before gm-data will, because the front-end is dealing with a lot of services concurrently.

  • GMDATA_NAMESPACE Typical value is world. In order to avoid having to create root access tokens to get the system bootstrapped, We allow for the creation of a self-service directory. If this value is /world then the home directory can be created here, on the condition that the directory is named after the field mentioned GMDATA_NAMESPACE_USERFIELD, which is typically email. For example: /world is created empty on init of gm-data. User uses static/ui to create directory /world/rob.johnson@email.com, which is only allowed because he came in with a JWT token matching {values: {email: ["rob.johnson@email.com"]}}.

  • GMDATA_NAMESPACE_USERFIELD Typical value is email.

If an environment variable you are looking for was not mentioned here, it's likely something that is not something that you should need to change in a normal setup. For more detail of the auto-generated documentation on environment variables used in gm-data, see:

Kafka Connect

In order to point to a Kafka, in the simplest plaintext case, set env vars relating to Kafka. At a minimum, point to the brokers and name the topics.

KAFKA_PEERS=kafka:9092
KAFKA_TOPIC_ERROR=gmdatax-error
KAFKA_TOPIC_READ=gmdatax-read
KAFKA_TOPIC_UPDATE=gmdatax-update

Deploy - Environment Variables

Name

Default

Description

Example

Type

DISABLE_LOOKUPS

false

don't dns check env vars representing hosts

true

DONT_PANIC

false

disable panic when environment looks mis-configured

true

LESS_CHATTY_INFO

true

chatty info logs will write something to the log when a transaction begins, when there are no problems

false

CLIENT_JWT_PREFIX

/services/gm-jwt-security/1.0

endpoint prefix for primary jwt service to resolve pointers to JWT tokens

/services/gm-jwt-security-gov/1.0

CLIENT_JWT_ENDPOINT_ADDRESS

ip of jwt server in the network

a hostname

CLIENT_JWT_ENDPOINT_PORT

port of jwt server in the network

8443

an unsigned int

CLIENT_JWT_ENDPOINT_CERT

JWT server client cert

base64 line pem written to certs/jwt.cert.pem

CLIENT_JWT_ENDPOINT_KEY

❗JWT server client key

base64 line pem written to certs/jwt.key.pem

CLIENT_JWT_ENDPOINT_TRUST

JWT server trust

base64 line pem written to certs/jwt.trust.pem

CLIENT_JWT_ENDPOINT_PREFIX

prefix to reach the CLIENT_JWT_PREFIX when proxied

localhost

CLIENT_JWT_ENDPOINT_USE_TLS

false

use TLS to connect to jwt endpoint

true

CLIENT_JWT_ENDPOINT_CN

the server name expected for this cert

GMDATA_FABRIC_CLUSTER

default

the name of this fabric cluster

us-east

ZEROLOG_LEVEL

WARN

logging level: INFO, DEBUG, WARN, ERR

INFO

MASTERKEY

❗❗Masterkey for the encrypted content

som3r9doMg1bberish

master key for the data

AWS_REGION

Bucket location

us-east-1

some non-whitespace token

AWS_S3_BUCKET

Bucket name, overridden by FILE_BUCKET

AWS_S3_BUCKET= must match a token without whitespaces or special chars

AWS_S3_PARTITION

Subdirectory within the S3 bucket, overridden by FILE_PARTITION

username

FILE_BUCKET

Bucket name

FILE_BUCKET= must match a token without whitespaces or special chars

FILE_PARTITION

Subdirectory within the file bucket

username

AWS_S3_ENDPOINT

Bucket host override

s3.region.aws.com

a hostname

AWS_REKOGNITION_ENDPOINT

Bucket host override

rek.region.aws.com

a hostname

AWS_ACCESS_KEY_ID

Set if not using IAM roles for the machine

AKAI...

iam roles used

AWS_SECRET_ACCESS_KEY

❗Set if not using IAM roles for the machine

AEFE...

iam roles used

USES3

true

Use S3

false

S3 bucket setup

S3_TASKS

512

Max number of concurrent S3 tasks

64

an unsigned int

KAFKA_PEERS

Kafka nodes to talk to directly. A comma-delimited list of host:port pairs

localhost:9092

a comma-delimited list of host:port

KAFKA_TOPIC_UPDATE

Kafka topic for update events

gmdu

some non-whitespace token

KAFKA_TOPIC_READ

Kafka topic for read events

gmdr

some non-whitespace token

KAFKA_TOPIC_ERROR

Kafka topic for errors

gmde

some non-whitespace token

KAFKA_CONSUMER_GROUP

test1

Kafka consumer group id

imageconverters

some non-whitespace token

KAFKA_CERT

id cert

single line base64 of pem

KAFKA_CERT is expecting a single-line base64 encoded string

KAFKA_KEY

id key

single line base64 of pem

KAFKA_KEY is expecting a single-line base64 encoded string

KAFKA_TRUST

id trust

single line base64 of pem

KAFKA_TRUST is expecting a single-line base64 encoded string

KAFKA_USE_TLS

false

use TLS for kafka directly

true

KAFKA_CN

false

cn for kafka

true

TEST_JWT_PRIV

❗❗test only! a base64 encoded single line of the private key for internal signing during tests

base64 encoded line

JWT_PUB

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB is expecting a single-line base64 encoded string

JWT_PUB_1

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_1 is expecting a single-line base64 encoded string

JWT_PUB_2

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_2 is expecting a single-line base64 encoded string

JWT_PUB_3

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_3 is expecting a single-line base64 encoded string

JWT_PUB_4

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_4 is expecting a single-line base64 encoded string

JWT_NOT_BEFORE_SKEW_SECONDS

86400

seconds that not-before is in the past, to handle mutual clock skews

60

an unsigned int

MONGOHOST_MASTER

Mongo host ip:port that we replicate with

m1:27017,m2:27017

a comma-delimited list of host:port

MONGODB_MASTER

Mongo database we replicate with

gmdatadev

some non-whitespace token

MONGOHOST

Mongo host ip:port

m1:27017,m2:27017

a comma-delimited list of host:port

MONGODB

gmdatax

Mongo database

gmdatadev

some non-whitespace token

MONGO_CERT

Mongo TLS cert base64

cat ./certs/server.cert.pem | base64 -w 0

MONGO_CERT is expecting a single-line base64 encoded string

MONGO_KEY

❗Mongo TLS cert key base64

cat ./certs/server.key.pem | base64 -w 0

MONGO_KEY is expecting a single-line base64 encoded string

MONGO_TRUST

Mongo TLS trust base64

cat ./certs/server.trust.pem | base64 -w 0

MONGO_TRUST is expecting a single-line base64 encoded string

MONGO_CN

Mongo SNI name

MONGO_SOURCE

Mongo login source

$external

MONGO_MECHANISM

Mongo login mechanism

MONGODB-X509

MONGO_USE_TLS

false

Mongo use TLS

true

MONGO_INITDB_ROOT_USERNAME

MongoDB user id

mongoadmin

MONGO_INITDB_ROOT_PASSWORD

❗MongoDB password

S0m3Pass

TEST_LOAD_ITERATIONS

number of iterations for load test

10000

an unsigned int

GMDATA_NAMESPACE

A Directory in the root that lets you create content as yourself

GMDATA_NAMESPACE_USERFIELD

The field that is that matches up with the directory you can create

GMDATA_NAMESPACE_TEMPLATE

(if (contains %s "%s") (yield-all) (yield R X))

The default template to create a user implicitly

DELETE_EXPIRED

false

Actually remove expired entries periodically to comply with privacy laws

DELETE_EXPIRED= should be true or false

DELETE_EXPIRED_POLL_SECONDS

600

Number of seconds to poll for expired data

3600

an unsigned int

NOTIFICATION_CACHE_SIZE

1000

Number of items to cache when watching notifications on an oid

100

an unsigned int

MIMETYPES_OVERRIDE

Supply an alternate mime.types

./mime.types

LISTING_DEBUG

false

Turn on debug for listing package

true

BIND_ADDRESS

0.0.0.0

bind address for port

127.0.0.1

a hostname

BIND_PORT

8181

bind port

9123

an unsigned int

PRETTY_PRINT

true

pretty print returning json by default. set this to false in production, as it makes json larger.

false

HTTP_TRANSPORT_CANCEL_HOURS

4

Hours before http call is cancelled

24

an unsigned int

USE_PPROF_CPU

true

CPU profiling in pprof

false

USE_PPROF_MEM

true

mem profiling in pprof

false

HTTP_CACHE_SECONDS

10

http default cache in seconds

60

an unsigned int

TRACE_LOG

write a trace to file name

/logs/trace.out

REKOGNITION_FACE_INDEX

Set a face index for AWS Rekognition

hackathon

LOG_OPEN_FILE_HANDLES

true

log open file handles to look for leaks

false

GMDATAX_CATCH_PANIC

false

catch panics rather than restarting gmdatax

true

GMDATAX_SESSION_MAX

4096

max http sessions in progress

10000

an unsigned int

JWT_API_KEY

jwt api key

a password

JWT_API_KEY is expecting a single-line base64 encoded string

NAMED_BANNER

true

include name in banner

false

GMDATA_CERT

id cert

single line base64 of pem

GMDATA_CERT is expecting a single-line base64 encoded string

GMDATA_KEY

id key

single line base64 of pem

GMDATA_KEY is expecting a single-line base64 encoded string

GMDATA_TRUST

id trust

single line base64 of pem

GMDATA_TRUST is expecting a single-line base64 encoded string

GMDATA_USE_TLS

false

use TLS for gmdata directly

true

GMDATA_REQUIRE_CLIENT_CERT

true

demand a client cert

false

GMDATA_AUTHENTICATION_HEADER

USER_DN

a header that is TRUSTED to contain an authenticated user id. disable with value '-'.

-

POLICY_CACHE_LIFETIME

60

amount of time an object lives in objectpolicy cache

30

an unsigned int

Authentication, Authorization, ObjectPolicy, and Permission

Policies

This service does not contact an authorization service (or authentication service for that matter). This service should stay that way. In order to have a secure service, encryption of data is not nearly enough. There needs to be a workable system for calculating access to objects. For the authentication part, we do not want a username/password for authentication purposes (though we might want a password for end-to-end encryption purposes):

  • Treat incoming users as a set of digitally signed attributes only. They are not usernames that we need to go look up elsewhere.

  • This allows us to operate without having to further defer to more backend microservices for authorization.

  • I HAVE NO backend user service that I can defer to at the moment.

  • There is a standard that does this, and it is possible to use a minimum subset of it so that it is easy to secure and implement.

    • JWT (json web token) is as simple of a specification as it can be. There is an RFC mess over top of it that brings in hazards, and we can not implement that.

      • The basic spec contains a base64 header that hints on the signing algorithm (and we must CHECK that it is the only thing that we expect).

      • The basic spec allows expiration dates to be encoded.

      • The basic spec allows for arbitrary content in the payload.

      • All of the header and payload are json chunks, so that it is not hard to parse or document.

Example Of Generic Authentication

Some signing service, that is not us, generates a DSA signing key ❗:

-----BEGIN EC PRIVATE KEY-----
MIHcAgEBBEIB3hA+StvLndr3qCMhY8SOWu5MM/Oim2SqVA8GFWV+Lmnc03OuacyZ
...
uqTT+pE6m1KYIbuBrsv0TgIrYPWXMdPpTUaUGytBtw==
-----END EC PRIVATE KEY-----

From that DSA signing key, it can generate and publish a DSA public key:

-----BEGIN PUBLIC KEY-----
MIGbMBAGByqGSM49AgEGBSuBBAAjA4GGAAQBDZpYnSarEIirBqbqxzqpV+HyXkx0
...
K2D1lzHT6U1GlBsrQbc=
-----END PUBLIC KEY-----

If we trust this service to sign attributes, then we include this public key in our trust store of public keys. We currently only support one public key for this purpose. If that server generates a digitally signed statement. This statement is designed to be passed in http headers, used in OAuth tokens. The user goes to an Authorization service that we have to login (with an X509 certificate or a password. It can SUGGEST to the site what needs to be signed. That service should reject any assertions that the user makes about himself that cannot be verified as true (according to its internal database), and the server is allowed to inject new attributes such as expiration.

{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }

So, the server responded to the user request and gave it this token (❗).

eyJhbGciOiJFUzUxMiIsInR5cCI6IkpXVCJ9.eyJhZ2UiOlsiYWR1bHQiXSwiZW1haWwiOiJyb2IuZmllbGRpbmdAZ21haWwuY29tIiwib3JnIjpbImRlY2lwaGVyIl19.AVDnBeIOgAsTblY1YbI4K7JQ_28zNbeVCS3fpKXkQMtHRJhSHZza9dgHuQhGwLn4gm_CngmPNRwkzDJHjg6AFJ12AMAxX4u04Im4EQQWOKAasOLr2A-3-uDaq18hU_s8siSA-24tru3WsqG_47bAuNYKt6m7mGQk3pBi2upWgYRJzRC7

Our app ONLY sees these tokens on incoming requests, that were set on http headers (cookies, authentication). The first part of it is actually not encrypted. But this bearer token should NOT be leaked to anybody but the service that needs this token. It decodes in plaintext as this:

Header:

{ "alg": "ES512", "typ": "JWT" }

Claims:

{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }

The header says how to interpret the tail-end, which contains the digital signature. It is critical that we only allow alg that matches our actual trust cert, which is ES512. We should also only be honoring claims that include expiration timestamps, to limit the time that a leaked token can be used.

UserPolicy Specific

If a user comes in with a statement such as the one above, then we can accept it as a true statement. There needs to be some specific structure in these in order to be useful. Presume that we design the Claims payload to specifically look like this:

{
"exp": 43143143,
"label": "asRobJohnsonWork",
"values": {
"org": [ "decipher", "ieee" ],
"citizenship": [ "US" ],
"email": [ "rob.johnson@email.com", "rob.johnson@another-email.com" ],
"age": [ "adult" ],
"clearance": [ "confidential" ],
}
}

Some things about this document:

  • It has an expiration date, after which we should not honor the signature

  • The values are specific to an application family that want to automatically process flexible policies.

    • It has a regular structure to it. map[string][]string is the regular structure for the values.

    • This regular structure helps a domain-specific language to be written into objectPolicy for evaluation later

ObjectPolicy Specific

When an object is created in the system, a policy is attached to it. The policy is authored in JSON (and literally stored as BSON). It is actually a function. The output of this function is a Permission. The inputs are the claims of the input token. Each of the sample objectPolicies are attached to an object on each of its updates.

These statements are rendered as LISP syntax for legibility. The originalobjectpolicy field can be set to this LISP statement directly, and the server will set the objectpolicy field to the equivalent, but much more verbose, json value of objectpolicy. The field originalobjectpolicy might also be set to something that is not our LISP language, such as an proprietary ACL language that gm-data does not understand directly.

This is how you would allow anonymous access (did not present us with a UserPolicy at all, then give them R and X access (ie: read is for properties. execute is for streams and directory listings.):

(yield R X)

Note that this LISP syntax statement compiled to the json that is actually set in the objectpolicy field is.

{
"f": "yield",
"a": [
{"v": "R"},
{"v": "X"}
]
}

In general, the json field "f" is the first argument in a parenthesis list (head of the list in LISP terminology), and "a" is the list of remaining arguments (tail of the list in LISP terminology). The actual string values are represented with "v". This means that the transform between LISP and actual objectpolicy is straightforward, and goes in both directions. Double-quotes can be used in values to include spaces and parenthesis in the actual value.

If we need audited-but-public access, then we can demand an email, just so that it shows up in audit logs:

(if (tells email) (yield R X) (if (has some email rob@example.com) (yield-all)))

This is a file that is owned by a particular email address (full access, including P for purge):

(if (contains email rob.johnson@email.com) (yield C R U D X P))

If this file is owned by a user and shared for read-access to a group:

(if (contains email rob.johnson@email.com)
(yield C R U D X P)
(if (contains org decipher)
(yield R X)))

You have to be a non-dual citizen to use this resource. (ie: US citizen required, but must not also be a citizen somewhere else):

(if (and
(contains citizenship US)
(not (has not citizenship US)))
(yield R X))

Everything in this system is an event such as a Create, Update, or a Delete. When updating or deleting an object, it will search for the latest version of the object to evaluate the userPolicy against. Similarly, for a Create, it will look up the latest version of the parent directory and check the permissions against that. These evaluators take UserPolicy in as input and return permissions as output. Each of these evaluators is attached to each version of a resource that is access controlled.

As a general pattern, expect that access control whether a file is known to exist, and ownership determines who has edit access on a file, while read access may be granted otherwise. That means that this pattern should be common:

(if (COMPLICATED_ACCESS)
(if (OWNERSHIP)
(yield-all)
(yield R X)
)
)

Example of "US Adult citizens can read files and metadata", that is "jointly editable by two specific email addresses or anybody in a specific admin group". Anyone not meeting the access threshold will even know that the file exists, which would require R or X.

So, taking COMPLICATED_ACCESS to be:

(and (contains citizenship US) (contains age adult))

And OWNERSHIP to be:

(or (contains email rob.johnson@email.com danielle.miller@email.com) (contain group "deciphernow admin"))

We then would have a full originalobjectpolicy of:

(if (and (contains citizenship US) (contains age adult))
(if (or (contains email rob.johnson@email.com danielle.miller@email.com) (contain group "deciphernow admin"))
(yield-all)
(yield R X)
)
)

Which when compiled to its equivalent json, becomes the requirements field of objectpolicy.

Combining The Concepts

UserPolicy values:

{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }

Is allowed access because we told our email.

(if (tells email) (yield R X) (if (contains role "administrator") (yield-all)))

And this object has full ownership by any bearer with an email equal to rob.johnson@email.com:

(if (contains email rob.johnson@email.com)
(yield C R U D X P)
(if (contains org decipher)
(yield R X)))

The ObjectPolicy Language

The current policy language has a minimum number of functions to demonstrate the capability. It will probably needs some higher-level functions to compress common policy expressions.

  • (if $cond $trueBranch $falseBranch) : bool Based on boolean truth of $cond it will execute either $trueBranch or $falseBranch and not evaluate the other branch.

  • (and $arg+) : bool And against any number of arguments (usually two).

  • (or $args+) : bool Or against any number of arguments (usually two).

  • (not $args+) : bool Not against one argument.

  • (contains $field $arg+) : bool We presume that $field is a []string. It returns true if any arg matches. (note: we should probably have contains-and and contains-or, just to be less surprising)

  • (has $op $field $arg+) : bool We presume that $field is a []string. $op can be specified as either eq, not so that we can express more than equality that contains can do. Multiple args will do. ie: (has not employer snapchat zynga). (note: we probably want has-or and has-and to make it less ambiguous.)

  • (tells $field) : bool We demand that this field is told to us in UserPolicy values. This supports saying that we want an attribute to exist for auditing purposes. Example: (tells email)

  • (yield $args) : (bool,permssions-side-effect) If we encounter this in the tree, then we write each argument to the output list.

  • (allow-all) : (bool,permissions-side-effect) Is identical to yield-all

  • (allow-read) : (bool,permissions-side-effect) Is identical to yield R X

  • true : bool Is used in cases where we unconditionally yield permissions

  • false : bool Is used in cases where we document that this branch of the expression fails.

When a true branch is taken, the whole expression reduces to the true branch only, until a leaf is reached. The leaf must be a macro that eventually yields possible permissions.

Possible Permissions

Note that we NEVER mutate data in this design. We insert events that sort on (objectId, tstamp) so that we maintain full history. This is what allows this permission system to have a uniform structure to it.

  • "R" - Reading metadata about resources such as file/dir, but not necessarily being able to retrieve files or list directories. Without "R" we are not even told that the file exists. It wont be in the listings.

  • "X" - Execute. This is not the same as Unix filesystem "execute". What we mean is listing directories (same as in Unix), and for opening the file stream (not same as Unix).

  • "C", "U", "D" - For the mutation operations we need to look up the latest version of what we are changing before we allow the change to be appended.

    • When performing one of these on a dir, we check for "C" on the latest version of the parent dir.

    • When performing update on a dir or file, we check for "U" against the latest version that has this objectId that we are updating.

  • "C" - Create. This is used on directories. When inserting a file, we check that we have C on the latest version of the parent directory. Creates in the parent will be allowed by UserPolicy that yields C.

  • "U" - Update. This is used on everything. Updates on this objectId will be allowed for UserPolicy that yields U.

  • "P" - Purge. This DOES alter the database. Records are automatically hidden when overridden by new versions, and when they expire. But they are physically in the database. For compliance situations, such as GDPR, we must be able to purge objects as well. This means periodically complying by physically removing items that are either expired, or that a user has demanded be removed.

The reason we never mutate data is that we need to be able to recreate a replica of the database from the Update logs alone. Purge is kind of an exception to mutating data. But it can be safely included as long as we:

  • Never recreate the same objectId more than once (in the update audit logs)

  • Hold a purge request and only execute it after we actually find what we are asked to purge. This way we don't lose purge requests on replication.

Also, internally, we take advantage of the immutability of data so that we can run queries on snapshots in time. As we are processing as of the current time Now(), more records are coming in with future timestamps on them.