Index Operations Count

Hello,

I am building a SaaS product for which FaunaDB seems quite a good fit. However, it is extremely important for me at the moment to make sure that I understand the operation count for indexes in details, before I can proceed. I find that the documentation about indexes gives a bit too much room for interpretation, but combining it with some playing around, I arrived to the following conclusions, which I would like to have confirmed or corrected, and some questions.

For simplicity, let’s make the following two assumptions:

  • all indexes I use have a non empty list of terms
  • all indexes I use have exactly 2 values, one of which is the id of the document Ref

Read operations
Since the terms list is not empty, every read only needs to look into one partition. Therefore, for each index one readops is count for each 4kB read from the index. However, such read operations don’t group up, so if I read 100B from 40 indexes, I make 40 readops. Is this correct?
Also, I would like to know if and how the terms have an influence on the number of readops per index. My guess so far would be that the total terms list size is accounted once per index read, while the values are of course counted for each results provided by the match index operation. So for example if there are 50 results, then I should (roughly) account for a total readBytes of 50x(avg. values size)+terms size in the match index query. Is this correct? Or do terms have an bigger impact?

Write operations
For each create/update/delete of a document from a collection, only the indexes with matching terms are updated. So the other indexes do not occur in write operations. Is this correct?
Suppose now that we are updating a document. Since the id is not going to change, under the above assumptions, for each index we have exactly one field in the index that can possibly change. My question is, are the fields that stay the same also rewritten (in the index result) and account for bytesWrite or only the one which change?
Also, according to the documentation the index writes groups. So if I write update the document with 100B, and there are from 20 indexes which get updated by 45B data each, then this counts in total for 1 readOp.

Compute Operations
Compute operations (after the first one per query) are only counted when the Index is created and the total amount depend on the preexisting documents in the collection. Correct?

Hi @JamgoTech !

However, such read operations don’t group up, so if I read 100B from 40 indexes, I make 40 readops. Is this correct?

Yes.

Also, I would like to know if and how the terms have an influence on the number of readops per index.

Index terms affect read and write ops. Each term’s value has to be stored in an index entry, so they affect write ops directly. Read ops are also affected when/if the term storage crosses 4k page boundaries. For small terms tuples, multiple index entries fit into a 4k page. For large terms tuples, multiple pages might need to be read for a single index entry.

There’s no distinction between read ops for terms and values, since the database has to read index entries and an index entry can contain both terms and values.

If you are attempting to fine-tune read/write ops for index entries:

  • Only index fields whose values have high cardinality. Low cardinality fields tend to incur more read ops that those with high cardinality.

  • Indexing fields with small values reduces read/write ops compared to fields with large values.

  • Index only the fields that you need.

  • Use the minimum number of indexes required to support your workflow.

For each create/update/delete of a document from a collection, only the indexes with matching terms are updated. So the other indexes do not occur in write operations. Is this correct?

No. All indexes whose source is a Collection that contains a document being modified incur write operations.

My question is, are the fields that stay the same also rewritten (in the index result) and account for bytesWrite or only the one which change?

Fauna stores history for each document. Indexes have to reflect that history, or the At() function would not return accurate results. So, even when the terms and values tuples for an index entry have not been modified, a new index entry is created sharing the timestamp of the document modification.

Compute operations (after the first one per query) are only counted when the Index is created and the total amount depend on the preexisting documents in the collection. Correct?

No. Compute operations are counted per FQL function calls in a query. The size of an index, or the number of documents in a collection, has no direct effect on the number of compute operations. If you have to issue multiple Paginate queries to investigate all of the matching documents in an index, then you’d incur compute ops for those function calls.

I hope that helps!

Hi @ewan ,
Thanks for your answer. I’d like to check with an example if I understand the writeOps count correctly. Suppose I have 15 indexes with source a given collection, and suppose that on average 100B of document data are indexed, and suppose that a document of the collection is on average of size 2kB. Then, at each update the expected (average) write load is 2+15x0.1 = 3.5 kB, thus 4 writeOps (assuming no other write happens in the same query). Do I understand this correctly? Or are there more writeOps in the example above?
My biggest concern remains the readOps though. It seems to me that the only situation in which one really can have a good control over the index readOps count is by using as terms only high cardinality fields which we expect to partition the index fairly evenly, that is with low variance. For collections which continuously increase in size, then an expected index readOps count should not not exceed O(log n) number of results per index query, where n is the number of documents in the collection, to have something one can hope to control well. But this means that the term field(s) cardinality should be at least Ω(n / log n) and the variance should be at most O((log n)^2). Outside of this scenario, I do see little hope to have a good control (unless of course the collection is of bounded size, case which is easy to handle). I do see several use cases that can meet the requirements above under reasonable assumptions, so I don’t want to give the impression that I am completely negative about this. I just wanted to express the obstacles to predictability that I see (actually not only of costs, but also of performance, because the above discussion relates to performance in a similar way).

I would love to know whether my analysis is agreeable with or I am misunderstanding how indexes work.

I’ve tried to repro your scenario as follows.

The sample document is a bit over 2K, with a 100-character id field, and 8 fields with 256 characters each. The field names also add to the overall size.

{
  ref: Ref(Collection("sample"), "342793282963636736"),
  ts: 1663172369210000,
  data: {
    id: '1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111',
    field1: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
    field2: 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
    field3: 'cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc',
    field4: 'dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd',
    field5: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA',
    field6: 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
    field7: 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC',
    field8: 'DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD'
  }
}

I created 100 indexes using some JavaScript:

  var arr = new Array(100).fill().map((e, i) =>
    q.CreateIndex({
      name: `sample${i}`,
      source: q.Collection("sample"),
      terms: [
        { field: ["data", "id"] }
      ],
    })
  )
  await client.query(
    q.Do(arr)
  )
  .then((res) => res)
  .catch((err) => console.log(err))

Finally, here’s what an Update to that document costs:

> Update(
  Ref(Collection("sample"), "342793282963636736"),
  {
    data: {
      id: "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222"
    }
  }
)
Query metrics:     218 bytesIn,      2,421 bytesOut,       129 queryTime, 
                     0 readOps,         65 writeOps,         5 computeOps,
                 6,870 readBytes,   66,413 writeBytes,       0 retries     

then an expected index readOps count should not exceed O(log n) number of results per index query, where n is the number of documents in the collection,

If your index defines terms, the number of index reads corresponds to the number of matching entries, not the number of documents in the collection.

If you want to know what a query costs, and you’re using the JavaScript driver, you can execute the queryWithMetrics method to return both the query result and the metrics values. Doing so would remove all doubt about the billable consumption of any query.

Hi @ewan ,

So, if understand the metrics correctly, 2 functions calls are used per each index update, resulting in 2x100+3=203 functions calls, hence 5 computeOps. When it comes to writes, the 65 is still unclear to me though. It seems to amount to about 640B written per each index. How is does that amount come in place?

The part of my question about the readOps, you truncated the quote a bit too early I think, because the meaning of that part changes like that.

How is does that amount come in place?

Sorry for the delay in answering. I wasn’t sure myself, and I found that the documentation’s description couldn’t help in providing an answer (we’ll get that fixed).

After discussing with our database engineers, the answer of 65 comes from:

  • 3 write ops for the document since it is 2.4K in size.
  • The remaining 62 write ops are for index operations.

The index entry modification involves:

  • A removal of the old index entry (because the term changed)
  • An addition of a new index entry.

Indexes sort their documents in two ways: by the values definition and by the document modification timestamp. So there are two ordered sets to modify, each needing both the index entry removal and addition.

Beyond the size of the modified id field, index entries store the covered document’s reference and timestamp. Even though the data changed was 100 bytes, the overhead is 57 bytes.

157 bytes * 2 (add+remove in the values set) * 2 (add+remove in the ts set) * 100 indexes = 62,800 bytes written. Since a write op is for each 1K (or less), 62 write ops get counted.

Ah okay, now it’s clear thank you. I was assuming that the removal was not charged as writeOp, so now it all makes sense.

Incidentally, I have another small suggestion for the documentation as well. I noticed that Fauna uses 1KB for 1024B, but it’s never mentioned on the website (if it is and I missed it, I am sorry and just ignore this). Since KB is an ambiguous notation, I think it’d be useful for Fauna users to specify this somewhere (or to switch to a non-ambiguous unit like KiB).

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.