Is it possible to do dynamic filtering (Filter, Range etc) without the use of Get on all the objects?

Lets say I have a statistics collection and the documents can have different keys, such as the following:

{
	car: "car1",
	type: "1",
	scores: {
		mean: 5.0,
		stddev: 1.0
	}
}
{
	car: "car1",
	type: "2",
	scores: {
		total: 15.0,
		min: 0.5
	}
}

I understand that you can do something like

Range(Match(Index("car_stats_by_type"), "2"), 0, 100)

To get a set of Documents that fit that criteria, which may be searching on the value 'scores', 'total'. From my understanding of read ops and due to the use of the index, this will only read the documents that fit the criteria in Range, making it efficient.

My problem is that I am trying to do a similar thing however the value that I filter from Range will be dynamic (I have lots of keys within my objects, with the key names changing depending upon external factors). For example, I may want to return all Car id values where total for stats type 2 is above a certain value, and the mean for stats type 1 is below a certain value. I have been thinking of two ways to achieve my solution:

  1. Get all documents, do Select and filter that way
  2. Create an index for each possible key I want to filter on (may be a lot of indexes)

The first approach means that my read ops will be really high, and the second approach means that there will be a lot of indexes and may cause issues on writing new documents. I also saw that you can create an index where the value is a query, to ‘dynamically’ create a value and use that as the index value to sort on instead. Is this possible to adapt to my use case in order to leverage the power of indexes (and therefore Range, Intersection and such)?

From my understanding of read ops and due to the use of the index, this will only read the documents that fit the criteria in Range, making it efficient.

Range, by itself, finds a subset within the provided set of values that fit within the start (inclusive) and end (exclusive) points. The set, at that point, is still abstract/hypothetical. To materialize the set, Fauna requires you to use the Paginate function.

Materializing an index into a page of results based on the index’s entries does not, in fact, read the associated documents. Read ops do occur to fetch the matching index entries but the documents themselves are not read. So index-only queries can potentially provide a huge savings in read ops, especially for large average document sizes, and including only small fields within the index.

With that background covered, on to your goal. Index construction is key. If your index is defined like so:

CreateIndex({
  name: "car_stats_by_type",
  source: Collection("cars"), 
  terms: [
    {field: ["data", "type"]}
  ],
  values: [
    {field: ["data", "scores", "total"]}
  ]
})

Then your Range query should succeed, but the only value returned for matching entries is the value in the total. You can’t fetch documents based on some field value; you need the document’s reference.

However, if you include the document’s reference, like so:

CreateIndex({
  name: "car_stats_by_type_for_total",
  source: Collection("cars"), 
  terms: [
    {field: ["data", "type"]}
  ],
  values: [
    {field: ["data", "scores", "total"]},
    {field: ["ref"]}
  ]
})

Then Range will return both the total value and the document’s reference.

That reference can help you combine the two conditions you require. You don’t just want the totals and the means, you want the documents that have those properties, either the total AND the mean, or perhaps the total OR the mean. Since Fauna’s indexes provide sets, you can use the set functions Union (for OR), Intersection (for AND), or Difference (for NOT).

So, based on the criteria in your example, the first group of documents to gather is the set of type 2 cars where the total is above a certain value. That’s this clause:

Range(Match(Index("car_stats_by_type"), "2"), 15, [])

I chose 15 as the “certain value”. The empty array means “to the end of the set”, so you get all totals from 15 and up. Note that documents without a total field are not included in the set.

The second set of documents are cars of type 1 with a mean smaller than a certain value. That’s this clause:

Range(Match(Index("car_status_by_type_for_mean"), "1"), [], 7.5)

Here, I chose 7.5 as the “certain value”. Since the empty array is in the Range call’s start position, this time it means “from the beginning of the set”, so you get all means up to (but not including) 7.5.

That would require an index like this:

CreateIndex({
  name: "car_stats_by_type_for_mean",
  source: Collection("cars"), 
  terms: [
    {field: ["data", "type"]}
  ],
  values: [
    {field: ["data", "scores", "mean"]},
    {field: ["ref"]}
  ]
})

If we want to verify that clause 2 is working (based on the sample documents that you provide):

> Paginate(
  Range(Match(Index("car_stats_by_type_for_mean"), "1"), [], 7.5)
)
{ data: [ [ 5, Ref(Collection("cars"), "330787989428896290") ] ] }

Now, to combine the clauses into a single set. How we do that depends on the data. Since your document structure can vary, there may be documents (as you described) that have total fields but not mean fields, those documents can only possibly exist in the set returned for clause 1 OR clause 2, but can never be in both sets. That rules out using Intersection, unless you specifically care about documents that have both. Otherwise, you’d have to Union the sets, to get documents with total OR mean fields.

Finally, as you can see in the test query, the results are tuples that include the total or mean value. We can’t simply Union or Intersection these sets (regardless of goal), because the numerical values add an additional, unwanted, axis of comparison. When we combine the sets, we want only the document references.

To do that, we need to employ Join and Singleton:

Paginate(
  Join(
    Range(Match(Index("car_stats_by_type_for_mean"), "1"), [], 7.5),
    Lambda(
      ["number", "ref"],
      Singleton(Var("ref"))
    )
  )
)

This query repeats the Range clause of the test query, but includes Join which matches up the results from the Range with a Lambda function that forms a single-value set based on the document reference, effectively ignoring the numerical value.

If we do that with both clauses, our query would look like:

Paginate(
  Union(
    Join(
      Range(Match(Index("car_stats_by_type_for_total"), "2"), 15, []),
      Lambda(
        ["number", "ref"],
        Singleton(Var("ref"))
      )
    ),
    Join(
      Range(Match(Index("car_stats_by_type_for_mean"), "1"), [], 7.5),
      Lambda(
        ["number", "ref"],
        Singleton(Var("ref"))
      )
    )
  )
)

The result is:

{
  data: [
    Ref(Collection("cars"), "330787947282432546"),
    Ref(Collection("cars"), "330787989428896290")
  ]
}

That’s both of the documents that you provided, and they both match the first clause OR the second clause, because of Union. Had we used Intersection, the result would be empty because no document has a total and mean.

Since the resulting set only contains document references, we can confirm that what these documents contain with:

> Map(
  Paginate(
    Union(
      Join(
        Range(Match(Index("car_stats_by_type_for_total"), "2"), 15, []),
        Lambda(
          ["number", "ref"],
          Singleton(Var("ref"))
        )
      ),
      Join(
        Range(Match(Index("car_stats_by_type_for_mean"), "1"), [], 7.5),
        Lambda(
          ["number", "ref"],
          Singleton(Var("ref"))
        )
      )
    )
  ),
  Lambda("ref", Get(Var("ref")))
)
{
  data: [
    {
      ref: Ref(Collection("cars"), "330787947282432546"),
      ts: 1651722819570000,
      data: { car: 'car1', type: '2', scores: { total: 15, min: 0.5 } }
    },
    {
      ref: Ref(Collection("cars"), "330787989428896290"),
      ts: 1651722859760000,
      data: { car: 'car1', type: '1', scores: { mean: 5, stddev: 1 } }
    }
  ]
}

After you get this working, then you can work out how the two subordinate Range clauses should be combined with the actual data you have in your collection.

2 Likes

Thank you very much Ewan, that is very helpful. I guess the next part I will have to figure out, is how to implement the indexes. In that example I gave, there are only a few variables such as total and mean and a few types of stats, but in production we will have a much larger amount, probably 100 different combinations in total. I imagine having this many indexes on a single collection would be infeasible, so would there be some way to make the index dynamic in a sense, or could the index itself return the entire scores object as the value, and then the specific key I’m after (such as total, but which change depending upon the type) be obtained using Select, and somehow using that within Range? I did read through the documentation and see some notes around indexes returning a non-sortable type, so I’m not sure if its possible or not.

I imagine having this many indexes on a single collection would be infeasible

Lots of indexes do “work”, but as the number of indexes grows, so do the write ops required to maintain them. At some point, a very large number of indexes could impact query performance, but I haven’t explored those boundaries. 100-ish indexes shouldn’t be a problem, performance-wise.

so would there be some way to make the index dynamic in a sense

The only dynamic aspect of an index is a binding, which can be used to compute a value for storage in the index entry when a document is created or updated. There is no dynamic aspect to indexes when matching with them.

Indexes can only store scalar values, not objects or arrays. You can define the values field to specify every scalar field within a document, which creates a large tuple for results. You’d have to deal with every value in the result tuple, and the read op advantage would decrease with every additional field added.

Since Range operates on a set, and there’s no way to build up a set without involving an index, that approach won’t work.

I don’t know enough about your intended workflow to make many further suggestions. However, you might consider using the event-sourcing pattern if your workflow could be broken down into stages that compute intermediate results.

Also note that, as a shared service, Fauna eschews the SQL database approach of “do what you can to satisfy this query, no matter how long it takes”. That means you might have to break your workflow into batches, especially if the number of documents considered at any one time is large. Range (and the other set functions) can time out when processing very large sets.

Thank you for that. Just one final question if you don’t mind, can a binding in an index compute a value that is not just a scalar value, and can it be dynamic depending upon the data in the document? I have seen one example where someone computed an array in the binding of an index (see: Transaction exceeded limit of 16MB with 56kb file - #11 by kdilla301 ). In my case, if I can make the return value an object, that may suit my workflow, however that may not be possible.

Bindings can compute a result based on the data in the document: that’s exactly what they are for. Binding execution happens when a document is created or modified, not when the index itself is used in a Match.

Since index entries only store scalar values, a binding must return a scalar value.

That said, if the binding returns an array, the behavior is the same as when the index defines a field containing an array: one index entry per array item is created. This is a feature that makes it easier to find documents based on array items.

For example, if an index’s values defines a field containing an array, such as a tags field, and a document’s tags field contains ["A", "B", "C"], three index entries are created, one each for the elements in the array. When you subsequently use Match(Index("<the index name>"), "A"), you’ll get an entry in the result set.

So, a binding can return an array, and you’ll get one index entry per element in the bindings return value.

If you are indexing bindings or fields that provide array, the number of index entries created is equivalent to the Cartesian product of the number of array items. So, be careful: your indexes could “explode” with writes due to relatively small changes in computed arrays. Two arrays with 10 items each results in 100 index entries. If that changes to 20 items each, that’s 400 index entries.

I just wanted to ask, because we use Range directly on the Index that contains tuples and not a single number, how does Fauna know which variable within the tuple to Range on? Does it just assume there is one number in the tuple, and do it on that one?

If your index’s values definition specifies three fields, Range scans the set of three-value tuples. For each value, it compares the start and end values that you provide against the tuples.

If you specify start or end as A, only the first value in the tuples is compared. If you specify ["A", "B"], then A is compared against the first value in each tuple, and B is compared against the second value in each tuple.

The set of tuples provided by an index is always sorted, so scanning to find the start and end locations is fast.

Thanks Ewan, you’re the best.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.