Bulk inserts

Hi,

I’m working with an external booking system, where, unfortunately, the only way to update availability information is to poll periodically an endpoint. This system is flaky and unreliable and I want to use FaunaDB as the source for availability queries from my clients. So I’m thinking of periodically sync this data to a collection. On every request, I’ll wipe out the collection and recreate it each time 2000-3000 documents.

I was wondering if there is a bulk insert option instead of creating them one by one.

I’ve looked at the importer tool and if I can, I want to avoid a java runtime but if that’s the only option, I might as well try that.

Thanks

2 Likes

Whatever the import tool does is probably also possible with pure FQL :slight_smile:.
Bulk imports would just be placing these create statements in a Do().
There is the request limit to take into account though (± 1MB) so you might have to do multiple calls and maybe group them per 50/100/500 depending on how big your documents are.

1 Like

How often are you polling?

Wiping the entire collection, then re uploading can kill you in the cost of write operations. But I suppose you’d have to work hard to hit the free tier limit of 50k write ops/day, especially if the DB is otherwise read only.

Overly-complicated alternative

If

  • If you neeeeeeeed to reduce write ops, or
  • if you have other collections that link to this data and you do not want to destroy the existing references

Then you can efficiently add/remove new/old items using the Difference function. This can compare some unique identifier in your polled data to an index of that id in the fauna data, which can determine which ids are new (or old for that matter).

Difference(Var('array_of_polled_ids'), Var('array_of_existing_ids'))

Reverse the arguments to get ids that exist in fauna but not in the polled data, to see what needs to be deleted.

Difference(Var('array_of_existing_ids'), Var('array_of_polled_ids'))

User Intersection to update existing data, if the polled data contains changed documents, rather than just new ones

Intersection(Var('array_of_polled_ids'), Var('array_of_existing_ids'))

Obtaining the array_of_existing_ids is one read op, and reading each of the above Sets is also just one more read op each.

Overly-complicated Example

    q.CreateFunction({
      name: 'createUniqueItems',
      body: q.Query(
        q.Lambda(
          ['pollData'],
          q.Let(
            {
              pollDataIds: q.Map(q.Var('pollData'), item => q.Select('id', item)),
              existingIds: q.Select(
                'data',
                q.Paginate(
                  q.Match(q.Index('idsOfItems'))),
                  {
                    size: 100000
                  }
                )
              ),
              newIds: q.Difference(q.Var('pollDataIds'), q.Var('existingIds')),
              newData: q.Filter(q.Var('pollData'), item =>
                q.Any(
                  q.Map(q.Var('newIds'), id =>
                    q.Equals(id, q.Select('id', item))
                  )
                )
              ),
              newItems: q.Map(q.Var('newData'), itemData => q.Create(q.Collection('Item'), itemData)
            },
            q.Var('newItems')
          )
        )
      )
    })
  })

Cheers! :nerd_face:

2 Likes

Thanks @databrecht

Do you suggest Do() for performance or consistency? In my case, neither of them is important. Any other advantages of using Do() instead of sequential creates?

Thanks @ptpaterson

I’ll poll once a day but I might need a variation of your “Overly-complicated Example” as I’ve noticed I’ve made a mistake in my original estimation of 2000-3000 docs. It’s a lot more than that. So I’ll probably need some logic in my updates.

Difference() and Intersection() functions were not on my radar so thanks for bringing them up.

In that case you would probably be fine with both sequential creates as well as Do(). Or parallel creates if you are able to and want to eliminate some network latency. If consistency of a certain chunk of data would have been important I would have told you to use Do(). But, of course, there is a limit to how much you can do in one transaction.

1 Like