Performance issue with sorting using Join

Hello,

I’m running into a performance issue with my Fauna database and so far I haven’t been able to come up with a solution.

I have a collection “names” holding around 30.000 records.

My goal is to fetch 10 names in random order.

In order to create some randomness I’ve assigned a random number to each record and created an index on this attribute:

{
name: "names_random",
serialized: true,
source: "names",
terms: [
    {
    field: ["ref"]
    }
],
values: [
    {
    field: ["data", "sort"]
    },
    {
    field: ["ref"]
    }
]
}

My query is as follows:

Select(
    'data',
    Paginate(
        Join(
            Match(Index("all_names")),
            Index("names_random")
        ),
        {size: 10}
    )
)

Although I get the result I want, the query time is around 15s which is well above the benchmark I have for this specific functionality (5-10s max).

Does anyone know how to rewrite this piece of code and/or restructure documents / indexes in order to get the performance I want?

Many thanks in advance!

You have the all_names Index first in the Join. So it’s going to hit all 30000 records THEN sort by the random number.

Can you try

{
  name: "names_random_first",
  serialized: true,
  source: "names",
  // no terms
  values: [
    {
      field: ["data", "sort"]
    },
    {
      field: ["ref"]
    }
  ]
}

Since it is only 10 records, and not too complicated, just Map, rather than Join. Remember that Join is not like an SQL operation but more like an imperative operation over an index.

Map(
  Paginate(Match(Index('names_random_first')), { size: 10 }),
  (sort, ref) => ref
)

Thanks @ptpaterson for your suggestion!

This would be a perfectly good approach. However, I’d simplified the problem a little bit for this posting. In fact what goes in the Join first is not just the all_names index, but a selection of names, excluding names that a user has already swiped on, custom names from other users, and names that don’t match user-specific filter criteria:

q.Intersection(
    q.Difference(
        q.Match(q.Index("all_names")), //include all names, except:
        q.Match(q.Index("names_by_top"), false), //1. all none-top names:
        q.Match(q.Index('names_by_user'), q.Var('userRef')), //2. names that have already been swiped on by the user
        q.Match(q.Index("names_by_userdeck"), q.Var('userRef')), //3. names that were in the last user deck
        q.Difference(
            q.Match(q.Index("names_by_custom"), true), //exclude all custom names, except:
            q.Match(q.Index("custom_names_by_user"), q.Var('userRef')) //1. custom names from user
        ),
    ),
    ...filterQueries //these contain custom user filters, for example gender, origin, etc
)

Would there be a way to make your order approach work without adapting the “values” attribute of all these other indexes that are being used to include the data.sort field?

Oh! That’s makes sense. The order you have then makes sense. step 1, filter all the possible values. step 2, sort by the random number.

I’m guessing that all_names index is still probably contributing to the bottleneck. Here’s some ideas, hopefully applicable to the larger query. They work on the idea to combine the smallest sets and fewest operations possible.

tip: try to reduce the size of the first set that is a part of the Diff.

For example

This would be equivalent to:

q.Match(q.Index("names_by_top"), true)

Depending on the ratio of true to false in the “top” then this might be really helpful. Or only sort out like 10 records lol :man_shrugging:

If there are any filters subsequent to this that are always run and filter significant users, perhaps that can be pushed to the top of the Difference.

tip: For that matter, if there are multiple filters that are ALWAYS applied, put them all in a single index.

If it’s possible to avoid Intersection with a single index, do it.

For example, if gender and origin are ALWAYS applied as filters, then put that index at the top of the diff.

q.Intersection(
    q.Difference(
        // start with all always-on filters
        q.Match(q.Index("names_by_base_filters"), true, Var('gender'), Var('origin'), /* and others?*/),
        q.Match(q.Index('names_by_user'), q.Var('userRef')), 
        q.Match(q.Index("names_by_userdeck"), q.Var('userRef')),
        q.Difference(
            q.Match(q.Index("names_by_custom"), true), //exclude all custom names, except:
            q.Match(q.Index("custom_names_by_user"), q.Var('userRef')) //1. custom names from user
        ),
    ),
    /* ... other filter queries */
)

tip: order the Difference arguments such that they will filter out the most documents sooner.

It’s my intuition that this will work, but I’ve not verified. I’d say at least worth testing. I don’t know if doing so will be much of an impact on your query though, based on what you’ve shared.

tip: use conditional logic to reduce Intersection and Difference complexity.

Again, don’t know if it’s applicable.

But if you can remove a single Intersection or Difference argument by checking some condition, then maybe you can optimize the query.

If(Equals(Select(['data', 'something'] , Var('userDocument')), 'some_value'), 'Intersect_THIS', 'Intersect_THAT')

This can help avoid working over “all” Sets.

Since Set operations like Intersection, Difference, Union, etc. typically also work on arrays, you can do some interesting things to build an array of SetRefs conditionally and the pass that to the Set operation.

tip: denormalize/ duplicate data.

This one makes me kinda sad, because FQL is great for working out complex queries without duplicating data. But if you REALLY REALLY need to optimize reads more, and writing isn’t so bad, then maybe there’s an option here.

Side note: This is possibly a redundant index, since Documents is available. Of course, if you have different values then by all means keep your index. But if not necessary to have different values, creating an extra default index could be costing you.