Can I update all documents in a collection with only one transaction?

Problem

I can’t figure out how to update all documents in a collection with only one transaction. I want the performance benefits of pagination. The collection may contain more than 100,000 (max page size) documents.

Is there an idiomatic example of single-transaction bulk update?

Assumptions

  • the (JavaScript) driver’s PageHelper calls Client.query for each invocation of its callback function (using cursor metadata as needed)
  • each call to Client.query runs in a separate transaction

Background

FaunaDB might be a good choice for my next project, so I’m working through the FaunaDB documentation using the JavaScript driver and FQL. So far I’ve:

  • Skimmed all newsletters
  • Read available Fauna blog posts chronologically from the first, “Welcome to the Jungle”, to mid-2018’s “Verifying Transactional Consistency with Jepsen”
  • Progressed through the official documentation from Welcome! to the Index tutorials

At this point I can do many of the things I want to do with FaunaDB, but I feel as though I must be missing something important when it comes to large collections.

1 Like

I am also trying to do something similar…

I am not 100% certain that my answer is the correct one but I’ll already try to provide you with some guidance.

Data Manager

Before you read on, maybe the data manager might help?

Can you do bulk updates in one transaction (+100k)?

No I don’t think that you can do that in one transaction since that would basically be a bad idea. Pagination is mandatory in FaunaDB, just like Indexes because (I believe) we want:

  • To provide a scalable model
  • Prevent users to shoot themselves in the foot and do a huge query that might be slow and costly by accident.

However! You might not need it.

I do believe however that you might not need it.
Could you elaborate on your use-case? I assume that you want to do some kind of migration and want to make sure you get all of your documents migrated (new ones might arrive in the meantime). If that’s correct, there is another way you can do it:

  • Start your migration, keep the timestamp of your transaction. If I’m not mistaking, pagination will continue to work on the snapshot (might want to verify this) that you initially started the first transaction. You can safely go through all the pages. Then of course, other documents might have arrived that are not migrated. In that case I would exploit temporality and ask FaunaDB: “what has changed in the meantime?” and do your migration on these changes until you are certain that everything is migrated and your new code is released.

Of course, you might require that to be all done in one step.

  • Do migration
  • Release new code

However (and again, these are my thoughts, I might be wrong), doing that while your application keeps running might be a bad idea. If you have a huge amount of documents, and we would allow you to do one huge migration at once while your application is still running and adding other write transactions then these write transaction will basically conflict with your migration and all of these (or your migration) would have to be retried.

If you were in such a case I would advise you to either do the pagination approach (if it’s feasible, e.g. if your new code won’t crash on the migrated documents) and/or pause writes to your application until the whole migration is done. It’s a problem that arises in every approach. The best approach is often to avoid that od code would break after a migration but that’s of course, not always possible. In FaunaDB however, you can exploit the temporality model if you want to keep your app online while you do such a migration

Disclaimer

There might be better ways, I don’t know everything yet about FaunaDB :slight_smile:

2 Likes

@databrecht is mostly on the money here, but to provide a really concrete answer: because of how fauna does transactions with a single round of consensus each data node has to apply transactions in a deterministic order. While we can check dependencies within a batch (and thus process them in parallel), we can’t check dependencies across a batch which means an entire batch must be processed before moving onto the next, we also can’t advance the transaction clock for the node (in future this will be more fine grained). In short: one big (write) can block all others from progressing until it has completed. In general, even if we make the locking more fine-grained it would still block all other writes for that collection. Fauna is designed for unbounded volumes of data, and so “update all the records transactionally” isn’t really feasible. It’s not really feasible in a regular SQL database, and many an outage has been caused by an update taking a table lock for a few hours.

A better solution for this kind of thing is to attach metadata per document that tracks the schema id and access the docs by udf. You can update the udf along with the schema to migrate documents in place as they are accessed. You can also kick off a background job to traverse the collection a page at a time. Once the migration is complete you can update the udf to revert back to a straight Get. In this way you always keep access to your data no matter how large the collection grows. It’s more involved, and we should probably make some cookbook code for it, but there are only so many hours in the day :slight_smile:

3 Likes

Here is an example where I am doing a bulk update in fql. Equivalent SQL is update order set shipvia =2 where customerid = “SPLIR”.

// Sample Order Document

{
"orderID": 10349,
"customerID": "SPLIR",
"employeeID": 7,
"orderDate": "1996-11-08 00:00:00.000",
"requiredDate": "1996-12-06 00:00:00.000",
"shippedDate": "1996-11-15 00:00:00.000",
"shipVia": 1,
"freight": 8.63,
"shipName": "Split Rail Beer & Ale",
"shipAddress": {
  "street": "P.O. Box 555",
  "city": "Lander",
  "region": "WY",
  "postalCode": 82520,
  "country": "USA"
},
"details": [
  {
    "productID": 54,
    "unitPrice": 5.9,
    "quantity": 24,
    "discount": 0
  }
]
  }

// Get all orders for a customerID. If no values are specified in Index, ref is returned by default.

CreateIndex(
{
name: "orders_by_customerID",
source: Collection("orders"),
terms: [{"field": ["data", "customerID"]}]
});

// Paginate through the above index. My example has only few documents with customerID = “SPLIR”

> Paginate(Match(Index("orders_by_customerID"),"SPLIR"), {size: 10000})
{
 data: [
Ref(Collection("orders"), "263439422384510483"),
Ref(Collection("orders"), "263439422385557011"),
Ref(Collection("orders"), "263439422780872211"),
Ref(Collection("orders"), "263439422787172883"),
Ref(Collection("orders"), "263439422788211219"),
Ref(Collection("orders"), "263439422788214291"),
Ref(Collection("orders"), "263439423055627796"),
Ref(Collection("orders"), "263439423056661012"),
Ref(Collection("orders"), "263439423059794452")
 ]
}
>

// Bulk update shipvia for all customer with customerID = “SPLIR”

 Map(Paginate(Match(Index("orders_by_customerID"),"SPLIR"), { size: 100000}), 
    Lambda("x", 
    Let({ NewshipVia: 2 },
          Update(Var("x"),{ data: {"shipVia": Var("NewshipVia")}})
      )))

Thanks ben and @databrecht. The metadata+UDF approach to collection-wide updates seems reasonable.

I confess I don’t have much experience working with databases that support flexible documents. And I have seen table-locking schema migration cause problems in regular SQL databases. In general, though, I’ve had the luxury of working on projects that could perform schema migration offline.

As I read about FaunaDB I often wonder how I would do something I’ve done before with regular SQL databases. I should give more thought to whether the things I want to do make sense given:

  • FaunaDB design
  • no legacy design constraints
  • no legacy operational/procedural constraints
  • the desire to offer highly available applications
  • not just server clients, but server clients and many ABAC users

I’ll continue working through FaunaDB documentation, and I’ll take a look at the data manager (FDM).

2 Likes