Do and creation of schema documents

I’m trying to perform a query using the Do expression, but it doesn’t seem to be working as expected. I’m trying to first create a collection, and then create an index whose source is that collection.

The python code doing this is below:

def create_migrations_collection_and_indexes(fauna_client):
    """Create the required migrations collection and index if they do not already exist."""
    # This contains the query used to check if the migrations collection already exists,
    # and create it if it doesn't.
    create_migrations_collection = query.if_(
        query.exists(query.collection("migrations")),
        None,
        query.create_collection({"name": "migrations", "history_days": 0}),
    )

    # This contains the query used to check if the unique constraint on migrations
    # already exists, and create it if it doesn't.
    create_migrations_unique_index = query.if_(
        query.exists(query.index("migrations_unique_name_and_namespace")),
        None,
        query.create_index(
            {
                "name": "migrations_unique_name_and_namespace",
                "source": query.collection("migrations"),
                "terms": [
                    {"field": ["data", "name"]},
                    {"field": ["data", "namespace"]},
                ],
                "unique": True,
            }
        ),
    )

    logger.debug(
        "Attempting to create the migrations collection and"
        " migrations_unique_name_and_namespace index."
    )
    fauna_client.query(
        query.do(create_migrations_collection, create_migrations_unique_index)
    )
    logger.debug(
        "Successfully created the migrations collection and"
        " migrations_unique_name_and_namespace index."
    )

If I try to run this function, I get the following error:

faunadb.errors.BadRequest: ErrorData(code='validation failed', description='document data is not valid.', position=['do', 1, 'else'], failures=[Failure(code='invalid reference', description='Cannot read reference.', field=['do', '1', 'else', 'source'])])

This seems to indicate the reference to the created collection is not valid. If I create the collection before running this code, everything works fine.

Looking at the Do documentation it states:

The Do function evaluates a list of expressions which are provided as arguments. This evaluation occurs sequentially, from left to right, ensuring that modifications made by earlier expressions are seen by later expressions.

However, based on the error it seems this is not the case. What am I doing wrong?

I simplified the query and tested it in the web shell to confirm the If statements were not causing issues:

Do(
  CreateCollection({name: "migrations", history_days: 0}),
  CreateIndex({
    name: "migrations_unique_name_and_namespace",
    source: Collection("migrations"),
    terms: [{field: ["data", "name"]}, {field: ["data", "namespace"]}],
    unique: true
  })
)

The result was the same

[{"query":"Do(
  CreateCollection({name: "migrations", history_days: 0}),
  CreateIndex({
    name: "migrations_unique_name_and_namespace",
    source: Collection("migrations"),
    terms: [{field: ["data", "name"]}, {field: ["data", "namespace"]}],
    unique: true
  })
)
","error":"[
  {
    "position": [
      "do",
      1
    ],
    "code": "validation failed",
    "description": "document data is not valid.",
    "failures": [
      {
        "field": [
          "do",
          "1",
          "source"
        ],
        "code": "invalid reference",
        "description": "Cannot read reference."
      }
    ]
  }
]"}]

During a Fauna transaction, writes are applied at the “end” of the transactional processing when all of the query logic has been evaluated. Despite the description of the Do function behavior, which is correct, when your query attempts to create an index on the collection, the collection itself has not been written yet, so the Collection("migration") reference is not yet valid and the query fails.

There is a way forward though. When you call CreateCollection, or any of the Create* functions, the document created is returned. This is so that you can use the result to coordinate other “dependent” writes. Provided you capture the return value and extract the reference properly, you can definitely create a collection and an index that depends on it in a single query.

The query you write could look like this:

Let(
  {
    collection: If(
      Exists(Collection("migrations")),
      Get(Collection("migrations")),
      CreateCollection({ name: "migrations", history_days: 0 })
    )
  },
  If(
    Exists(Index("migrations_unique_name_and_namespace")),
    Get(Index("migrations_unique_name_and_namespace")),
    CreateIndex({
      name: "migrations_unique_name_and_namespace",
      source: Select("ref", Var("collection")),
      terms: [
        {"field": ["data", "name"]},
        {"field": ["data", "namespace"]},
      ],
      unique: true,       
    })
  )
)

That query captures the intermediate result of the collection creation (or fetches the existing collection), and then uses that result’s reference in the index creation.

Keep in mind that you cannot create a collection, index, function, or any other “schema” document and make use of it in the same transaction. For example, no creating documents in a collection created in the same transaction.

The recommended practice, when creating a test database, is to create all of the “schema” documents in one query, and all of the user-created documents in one or more subsequent transactions. For many test workflows, two queries should be enough to create hundreds of collections, indexes, and UDFs, plus thousands of documents.

Despite the description of the Do function behavior, which is correct, when your query attempts to create an index on the collection, the collection itself has not been written yet

I’m trying to reconcile this statement with:

This evaluation occurs sequentially, from left to right, ensuring that modifications made by earlier expressions are seen by later expressions.

Is a “write” not a “modification”? Is this scenario only specific to “schema” documents and not “user” documents? It seems to me to contradict the idea that modifications made by earlier statements are visible to later ones within the Do expression.

A write is definitely a modification.

During query processing, Fauna stages writes. When the query processor needs to read a document, it consults the staged writes first, and only when the required document is not in the stage does it fetch the document from storage. That’s how Do’s expressions can see the results of preceding expressions.

The problem with your query is not how Do is processed, but it is the result of two things:

  • When an index is to be created, the source collection(s) must already exist.

  • When you use Collection("migrations"), that forms a reference for a collection by name. Internally, a lookup must be made from the name to the system identifier used for the collection. Since that identifier had not yet been written, the existence constraint was violated.

My query works because it effectively tells the query processor which exact reference to use rather than looking one up. Before the collection creation has been written, its reference is not identical to Collection("migrations") even though it is after the collection has been written.

During query processing, Fauna stages writes. When the query processor needs to read a document, it consults the staged writes first, and only when the required document is not in the stage does it fetch the document from storage. That’s how Do’s expressions can see the results of preceding expressions.

Sorry, I’m still trying to understand this: Are you saying that inside of a Do that writes are not committed until the entire Do completes, but it is smart enough to look at prior writes within the Do to determine if the document is actually stored vs staged to be stored by the Do? That’s how I read the “…required document is not in the stage…” part, and this is how I would expect it to work.

The problem with your query is not how Do is processed, but it is the result of two things:

  • When an index is to be created, the source collection(s) must already exist.
  • When you use Collection(“migrations”), that forms a reference for a collection by name. Internally, a lookup must be made from the name to the system identifier used for the collection. Since that identifier had not yet been written, the existence constraint was violated.

My query works because it effectively tells the query processor which exact reference to use rather than looking one up. Before the collection creation has been written, its reference is not identical to Collection(“migrations”) even though it is after the collection has been written.

Based on my (possibly incorrect) understanding of the preceding paragraph, this still sounds like a bug to me. If the creation of the collection is a staged write, why would the query processor not look at the staged writes when evaluating Collection("migrations")?

Are you saying that inside of a Do that writes are not committed until the entire Do completes, but it is smart enough to look at prior writes within the Do to determine if the document is actually stored vs staged to be stored by the Do?

Yes. All writes are performed at the end of a transaction. And it’s not just Do. Any expression that needs to read a value that is staged for writing refers to the stage first. Staging writes is one of the strategies that make Fauna transactional. If any of the logic before the writes fails, or if any of the writes fail, the effects are not persisted and the transaction fails. Only when the logic+writes complete do the effects get persisted.

why would the query processor not look at the staged writes when evaluating Collection("migrations")

I didn’t say that it didn’t. I did say:

Before the collection creation has been written, its reference is not identical to Collection(“migrations”) even though it is after the collection has been written.

As I mentioned, using Collection("migrations") forms a reference by name. Internally, a collection reference uses a system identifier to disambiguate collections with the same name in different databases. That system identifier happens to be different for staged collection creations than for written collections. The before/after difference is what prevents your original query from working as you expect.

It might seem straightforward to modify query processing to only check for a collection’s name and so the staged writes lookup could provide the “obvious” answer. However, indexes provide the consistency required to make specific kinds of Fauna queries transactional. Simplifying the collection reference lookup to only be name-based would provide ambiguity to several portions of the internal query processing, and many kinds of transactions would no longer be consistent. Losing consistency would be a far bigger problem than how you have to construct a query to create a collection and a covering index in one transaction.

As I mentioned, using Collection(“migrations”) forms a reference by name. Internally, a collection reference uses a system identifier to disambiguate collections with the same name in different databases. That system identifier happens to be different for staged collection creations than for written collections. The before/after difference is what prevents your original query from working as you expect.

Ok, so if I understand correctly this behavior will occur with any built in function that turns a name into a reference in the context of Do? I believe most singularly named functions that allow referencing schema documents by name do this. What about functions like Call that can optionally take a name to look up a reference instead of a reference?

Ok, so if I understand correctly this behavior will occur with any built in function that turns a name into a reference in the context of Do?

Yes.

What about functions like Call that can optionally take a name to look up a reference instead of a reference?

The same name-based reference lookup issue exists:

> Do(
  CreateFunction({
    name: "doubleit",
    body: Query(
      Lambda("x", Add(Var("x"), Var("x")))
    ),
  }),
  Call(Function("doubleit"), 16.7)
)
Error: invalid ref
{
  errors: [
    {
      position: [
        'do',
        1,
        'call'
      ],
      code: 'invalid ref',
      description: "Ref refers to undefined function 'doubleit'"
    }
  ]
}

The general rule is: you cannot create a schema document and use it in the same transaction. So create all of your schema documents in one transaction, and then start using them in subsequent transactions.

The general rule is: you cannot create a schema document and use it in the same transaction. So create all of your schema documents in one transaction, and then start using them in subsequent transactions.

OK so this is the caveat I was looking for. I did some testing with match and other query patterns to confirm that creating and reading documents in the same transaction works as I would expect. But it sounds like this confirms that schema documents are special in that they are generally unable to be created and referenced in the same transaction. Is this a correct understanding?

But it sounds like this confirms that schema documents are special in that they are generally unable to be created and referenced in the same transaction. Is this a correct understanding?

Yes.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.