Export big collections via CDC or dump to S3 or Snowflake

We are considering transferring large amounts of our OLTP traffic from Postgres to Fauna. Fauna’s seemless scale is very appealing to us since we expect a large spike in our user base, and we already find ourselves fixing memory and CPU issues in Postgres.

The question is, as we accumulate large amount of data in Fauna what’s our options to export a large collection to an external system. For example Snowflake for BI queries, S3 for long term storage, Druid for real time OLAP queries or graph database like Neo4j.

When exporting data I think of two method:

  1. CDC, meaning capture changes in real time. For example, support the projection of changes to a Kafka topic.

  2. Dump, meaning downloading big chunks of data as one operation. For example, export a collection to one (or multiple) big JSON files in S3.

Do Fauna has something in those lines planned for the near future?

It might be a deal breaker for us not knowing we have an export strategy for the future in case we need one.

1 Like

CDC is a thing that is being worked on. I can’t commit to a timeframe, outside my wheelhouse.

2 Likes

Is using the temporal aspect of FaunaDB to monitor changes an option for you in the meantime?
It’s relatively easy (and efficient compared to dumbs) to say: ‘last sync of collection X was at timestamp Y, give me all created/updated/deleted documents’ thanks to the temporal features.

At this point it’s slightly inconvenient, you’ll have to combine two calls.

Get all events of a collection after timestamp

Paginate(Documents(Collection("fweets")), {
  events: true,
  after: Time("2020-05-22T19:12:07.121247Z")
})

this will give you create/delete events after a timestamp in an efficient way. Events on a collection only gives you create/delete so you are obviously missing updates there which is a bit cumbersome atm. So we’ll combine it with the next step.

Get all documents with timestamps after a given timestamp

You can however create an index on ‘ts’ and do the following.

Map(Paginate(
    Range(
      Match(Index("fweets_after_ts")),
      ToMicros(Time("2020-05-22T19:12:07.121247Z")),
      null
    )
  ),
  Lambda(['ts', 'ref'], Paginate(Var('ref'), {events: true, after: Time("2020-05-22T19:12:07.121247Z")}))
)

This will give you all documents that were created or changed after a given timestamp (create/update) and immediately gets the events of each of these documents after the same timestamp (update/create/delete)

Combine and conquer

FQL allows you to easily combine queries, so you could call those two in one transaction and therefore combine them in clean create/update/delete. I really need to write an example for that but as far as I know an external writer is working on that. In theory, that should give you access to a cheap and very robust (fault proof) pull-based sync system. Maybe that helps until we have something like CDC?

1 Like

Thanks for the detailed answer! This pull-based method could defentily work for us in the meantime (a more complete example would be of great value for us and others facing the same problem).

The ideal API for this would probably be something fully featured like the Change Streams API introduced in Mongo 3.6.

Just to give some more context, we are using Rockset to great success (it’s really kind of black magic to me :slightly_smiling_face:) for building real time leaderboards and dashboards. We want to stream changes as they occur to Rockset so they can be queried via the dashboards and leaderboards using OLAP style queries (sums, avgs, distincts, limits…).

Currently we are doing this with “dual writes”. The backend code (Node) simply makes two calls. One to Postgres and one to Rockset. Being a realtime ingest-first solution, Rockset can integrate with few dashboard clicks to Kafka and reflect the changes. This would eliminate the “dual writes”.

Just a couple months ago Rockset also released a support for seamless integration with Mongo Atlas Change Streams. I think that enabling this seamless integation with Fauna could really be a “match made in heaven”. Both services are “true” serveless, no-knobs, distruptive databases that compliment each other perfectly. Fauna is your OLTP handler and Rockset is OLAP.

Rockset sounds super interesting, especially for someone like me who was still a full-stack dev at a company that developed a Belgium alternative to dashboarding like Tableau/Qlikview a few months ago :slight_smile:

An integration between something like FaunaDB for OLTP and Rockset for OLAP would indeed be golden. I’ll definitely deliver this feedback to our team and might take a look how hard it is to implement.

Dual writes are indeed not ideal, I assume that opens you up to a world of hurt when one of those writes start to fail now and then.

1 Like

Fyi, an external writer (who is really good) is writing about a streaming solution on top op FaunaDB as we speak, he got delayed but I quickly asked how far he got with that approach, he might already have a concrete example. If not, I’ll try to see whether I can find time to write something more complete with an example collection.

1 Like

A little bit off topic, but do you know any startups that are currently using Fauna. I am specifically interested in startups that probably use the cloud Fauna (not on premise) and have limited engineering resources. Since Fauna is so cutting edge I haveing hard time finding some “hands on” and “lessons learned” information. Pitching it to stake holders is a challenge for me.

There are a few founders lurking around here on the forum I would advice you to make a separate thread, I’m sure they’ll respond :slight_smile:
ShiftX is one from the top of my head (https://shiftx.com/)

@Taro you mentioned you were interested in the use-case of people who are looking for streaming.
Above is one specific use-case (feed data into another system like Rockset in a sustainable and robust way). If you already have a something concrete in terms of code or experience that could help them, let us know :slight_smile:

2 Likes

I appreciate the kind words @databrecht :slight_smile:

@yaron, as Brecht mentioned, I’m currently writing an article relevant to what you seek. So far I’ve published an NPM package for polling FaunaDB. It’s still a bit in its infancy (won’t be after next week), but it should give you a good idea of what’s possible with FaunaDB’s temporal features, as mentioned in an earlier comment. I have a serverless architecture in mind for performing robust ETL/CDC from FaunaDB to a destination like Kafka, S3, etc. It’s a bit too lengthy to be included in the article I’m writing right now, however, I do plan on publishing/documenting a proof-of-concept after wrapping this article up. In the meanwhile, I’d be more than game to hop on a call with you sometime next week to answer any questions and help you address your requirements.

(Future readers, I promise I’ll return to this thread with more details regarding serverless ETL/CDC)

@Taro I’m super happy to hear that things are moving forward to reach that Fauna-to-Kafka/s3 goal. It will make Fauna such a complete package. I’ll see if I can arrange some time next week for a call. If I could not find time please allow me to keep that option for the future :slightly_smiling_face:

1 Like

you might also want to check out Pipedream!

1 Like

@ptpaterson Pipedream looks great! Dose it support updates to documents also? (not just add and delete).

Oh! That is a good (and important) question. Sorry I don’t know. I have not used it, just remember taking a quick look when it was posted to Slack.

1 Like

The pipedream connector is open-source so you can check that out :slight_smile:

So I don’t think it does (please correct me if I’m wrong if anyone from pipedream is reading this), I provided them with the details on how to improve it (same I explained above) though iirc, not sure whether they want to change it, maybe updates are not desired in pipedream, that I don’t know.

For in case it changes, this is the link:

3 Likes

Thank you all for introducing Pipedream. Looks like a great tool.

3 Likes