Bulk inserts

fred · July 24, 2020, 4:22pm

Hi,

I’m working with an external booking system, where, unfortunately, the only way to update availability information is to poll periodically an endpoint. This system is flaky and unreliable and I want to use FaunaDB as the source for availability queries from my clients. So I’m thinking of periodically sync this data to a collection. On every request, I’ll wipe out the collection and recreate it each time 2000-3000 documents.

I was wondering if there is a bulk insert option instead of creating them one by one.

I’ve looked at the importer tool and if I can, I want to avoid a java runtime but if that’s the only option, I might as well try that.

Thanks

databrecht · July 24, 2020, 4:57pm

Whatever the import tool does is probably also possible with pure FQL .
Bulk imports would just be placing these create statements in a Do().
There is the request limit to take into account though (± 1MB) so you might have to do multiple calls and maybe group them per 50/100/500 depending on how big your documents are.

ptpaterson · July 25, 2020, 1:46am

How often are you polling?

Wiping the entire collection, then re uploading can kill you in the cost of write operations. But I suppose you’d have to work hard to hit the free tier limit of 50k write ops/day, especially if the DB is otherwise read only.

Overly-complicated alternative

If

If you neeeeeeeed to reduce write ops, or
if you have other collections that link to this data and you do not want to destroy the existing references

Then you can efficiently add/remove new/old items using the Difference function. This can compare some unique identifier in your polled data to an index of that id in the fauna data, which can determine which ids are new (or old for that matter).

Difference(Var('array_of_polled_ids'), Var('array_of_existing_ids'))

Reverse the arguments to get ids that exist in fauna but not in the polled data, to see what needs to be deleted.

Difference(Var('array_of_existing_ids'), Var('array_of_polled_ids'))

User Intersection to update existing data, if the polled data contains changed documents, rather than just new ones

Intersection(Var('array_of_polled_ids'), Var('array_of_existing_ids'))

Obtaining the array_of_existing_ids is one read op, and reading each of the above Sets is also just one more read op each.

Overly-complicated Example

    q.CreateFunction({
      name: 'createUniqueItems',
      body: q.Query(
        q.Lambda(
          ['pollData'],
          q.Let(
            {
              pollDataIds: q.Map(q.Var('pollData'), item => q.Select('id', item)),
              existingIds: q.Select(
                'data',
                q.Paginate(
                  q.Match(q.Index('idsOfItems'))),
                  {
                    size: 100000
                  }
                )
              ),
              newIds: q.Difference(q.Var('pollDataIds'), q.Var('existingIds')),
              newData: q.Filter(q.Var('pollData'), item =>
                q.Any(
                  q.Map(q.Var('newIds'), id =>
                    q.Equals(id, q.Select('id', item))
                  )
                )
              ),
              newItems: q.Map(q.Var('newData'), itemData => q.Create(q.Collection('Item'), itemData)
            },
            q.Var('newItems')
          )
        )
      )
    })
  })

Cheers!

fred · July 25, 2020, 8:09am

Thanks @databrecht

Do you suggest Do() for performance or consistency? In my case, neither of them is important. Any other advantages of using Do() instead of sequential creates?

fred · July 25, 2020, 8:15am

Thanks @ptpaterson

I’ll poll once a day but I might need a variation of your “Overly-complicated Example” as I’ve noticed I’ve made a mistake in my original estimation of 2000-3000 docs. It’s a lot more than that. So I’ll probably need some logic in my updates.

Difference() and Intersection() functions were not on my radar so thanks for bringing them up.

databrecht · July 27, 2020, 12:06pm

In that case you would probably be fine with both sequential creates as well as Do(). Or parallel creates if you are able to and want to eliminate some network latency. If consistency of a certain chunk of data would have been important I would have told you to use Do(). But, of course, there is a limit to how much you can do in one transaction.

Topic		Replies	Views
How to bulk insert test data in Fauna DB? Help graphql , fql , best-practices	2	836	October 19, 2021
Bulk import? Help graphql	6	857	June 25, 2021
Multi-document Upsert Help fql , javascript	3	510	November 19, 2020
"Upsert" FQL example not working Help fql	5	541	March 20, 2022
Real Time CRUD operations Help fql , real-time	4	353	January 7, 2021

Bulk inserts

Overly-complicated alternative

Overly-complicated Example

Related topics