Looking for advice on a schema

Currently exploring Fauna DB as a possible backend for a new SaaS and was hoping to better understand a possible schema we might use. This is a B2C SaaS. The primary user is a single-user that signs up for a monthly plan.

In the most primitive way, the service allows a user to manage “lists”, where each “list” contains an arbitrary number of records. Because the fields of a record are user-defined, an unstructured datastore like FaunaDB is appealing. There’s nothing particularly novel about this. It’s your basic “spreadsheet” where each “list” is a spreadsheet with arbitrary columns and each row is a record.

In Fauna DB, we see “Lists” as being one Collection and “Records” being another Collection. Every list document and every record document would need some sort of user_id field.

With respect to FaunaDB, we have the following questions:

  • Collections seem very lightweight, is there any advantage to create a List collection for every user? (i.e.: One List Document that describes all of a user’s lists?)

  • Alternatively, could you make a Collection for each user’s lists? Since a Collection Name is unique, we’d need to create unique names, but presumably we could just do something like “user_id.list_name”. If you had thousands of users and with dozens of lists each, this would create a large number of Collections. Is that ok?

  • The same applies to the Records Collection. Do you put all records, for all users, in the same Collection, or is it better to create one Collection per user or even one Collection per user per list?

  • We want users to be able to perform some data analysis on each list. Obvious examples might be computing the sum, min, max, count, etc., of a field. Searching and querying are equally important, and since the fields are user-defined, creating indexes seems less useful.

  • In relational database parlance, we would likely be doing a lot of “full table scans” computing such values or performing such queries. Does that impact the decisions above? (i.e.: Is there a cost difference between iterating all documents in a Collection that match a user_id vs just iterating all Documents in a Collection that is wholly owned by a user?)

  • Finally, we’re trying to understand pricing. We suspect the vast majority of our operations will be read-based. (Import once in bulk from CSV files, then organize and analyze data.) If a user has a Collection with 1,000,000 Documents in it and they perform a search for “all documents where ‘xx =y’”, does that equate to “1,000,000 read operations” in the Pricing table? If that same user did 10 different searchs in a day, did they just burn up 10,000,000 reads?

Thanks in advance for any insight you can offer. Fauna DB looks very interesting, but we’re a bit concerned with the schema and, in particular, the pricing model so hoping to get some clarification so we can move forward.

Cheers

Hello @congers and welcome!! :wave:

A lot of questions here, I will try to take them one-by-one. If you have any follow-ups for implementing any of these aspects in greater detail, then it may be helpful to move to a separate topic with a more narrow focus.

Okay, let’s go!

Having a Collection per user, or a Collection for users’ lists could provide some convenience in that you wouldn’t have to specify the user and list as terms in searches.

That said, I wouldn’t necessarily recommend it. Your “terms” for user and list name just get moved to dynamically constructing Collection names (which drops type safety), and you would have to update your Roles every time a user is added.

FQL is quite powerful, so you can compute a variety of things in Fauna, including aggregations and such. However, on-the-fly analytics is not what Fauna is primarily designed for. If you need to perform a lot of ad-hoc analytical searches, then you might consider syncing Fauna with an external data warehouse that is optimized for that.

One way you can tackle aggregating stats is to run a background process and save a summary of the results. Ideally, you would only need to calculate the difference between the most recent summary and the latest changes to the db. If writing to your database is an infrequent operation, than this may work well for you.

This article is based on FQL v4, but the concept still applies to the latest API. This tutorial describes a strategy for reducing database contention in high write throughput situations. - Fauna Documentation

Yes, there can be a significant difference in scanning an entire collection:

let me = Query.identity() // assume a User reference
Record.all().where(r => r.user == me) // full scan

vs. using an index:

let me = Query.identity()
Record.byUser(me) // only reads the index

.where filters results, whereas an index is already efficiently storing subsets of documents by the terms. You are charged when you read documents, so you are still charged for reading each document even if the filter rejects that document.

You should only be using .where after you have narrowed down the Set of documents as much as possible with an appropriate Index.

Fauna does not employ a query planner. You are the query planner when you write FQL :slight_smile:

When you query a Collection or an Index, Fauna will continue to read until it can satisfy the page size. Default page size is 16, but you can set it to anything between 1 and 16000 (you’ll likely hit throughput limits on your account before successfully reading 16k anything).

So, for example, if you have 1M documents in the Record Collection, but only 1% of them match your user, then reading Record.where(r => r.user == User.byId("1234")) will have to read about 1600 documents to provide you with 16 results, and would cost you about 1600 Transactional Read Ops (TROs).

See our billing documentation for more details.

If you needed to read the entire set, rather than get just the first N results, then yes, it will cost 1M TROs to filter through 1M documents, and 10M TROs total to do that 10 times.

In the worst case, consider a scenario where there are zero matches for your filter. Fauna will continue reading the Collection, and charging you reads, until it reaches the end of the Collection and can report zero results. Up to 1M TROs for zero results!

Contrast that with using an Index. Record.byUser(User.byId("1234")) doesn’t have to filter anything, so reading a page of 16 documents only costs 17 TROs (1 for the index read and 16 for documents). Or only 1 TRO if there are zero matches :tada:

So, you should use indexes :smile:

Thanks so much for the detailed and thorough reply, much appreciated. I’m going to talk this over with the team and follow-up as necessary. Appreciate you taking the time to address so many points in one question.

In our use-case, we could use an index for the user_id to at least narrow down the documents we want to analyze for the currently logged in user, but it looks like we’d still be burning through a tremendous number of read operations if we’re querying a field that is not indexed. (Which is likely the case for us because the data we’re ingesting is unstructured, hence the appeal of something like Fauna).

To expand on our use case, users might import CSV files from unrelated sources but whose content (rows) represent financial transactions. Queries might then be performed on that collection where it’s necessary to look at every document (row).

ex: “How many documents have amount > $100”.

A lot of these queries would be ad-hoc and there’s no realistic way to know what field our users might want to analyze. From our description above, it sounds like these types of queries would read all documents owned by the user and if that number was large, it would quickly translate to a large number of costly read operations.

We’ll investigate further.