A proposal to unlock more meaning from queries through index parameters

hunp · March 14, 2023, 2:52pm

We can use indexes to build intricate relevance scores via a binding.

And we can use JOIN to relate different indexes or collections together.

However, if you want to build a relevance score that involves two indexes, there is no efficient way to do so.

For example, in a music database:
MUSIC_GENRE_BY_USER_AND_POPULARITY is an index that records how many times a user likes a particular genre.

From this, we can use a binding to calculate a genrePopScore with a time factor to answer the question “which genres does someone like the most and also most recently?”

Next we have another index SONGS_BY_GENRE

Which tells us the most popular songs in a genre with a similar time based songScore.

However, there is no way to combine the two scores efficiently.

What I really want is the genrePopScore x songScore which answers the question “which popular songs would this user really like?”

One way this could happen would be allowing index bindings to accept parameters.

So:

CreateIndex({
 name: "SONGS_BY_GENRE",
 source: {
    collection: Collection("SONGS"),
    fields: {
      songScore: Query(
        Lambda("ref", "genrePopScore"
            Let({
                  songScoreCalc: //some algorithm, 
                }, 
                  Multiply(Var("songScoreCalc"),Var("genrePopScore"))
        )
      )
    },
  },
  terms: [ field: ["data", "genre"],
  values: [
    //sort on songScore
    { binding: "songScore" }, 
    { field: "ref" }
  ],
});

The key line, and only change from existing FQL being

 Lambda("ref", "genrePopScore"

where genrePopScore would be passed through the call to Match eg:

 Match(
        Index("SONGS_BY_GENRE"),
        Ref("someGenre"),
         0.83
      )

It would still match against Ref("someGenre") but the second parameter 0.83 would be available in the bindings (both values or terms).

This way you can Join a Match against MUSIC_GENRE_BY_USER_AND_POPULARITY to generate a genrePopScore and pass it through to the SONGS_BY_GENRE index to finally get your combined score sorted with the most relevant data at the top of the results.

This would be useful in any number of scenarios. Imagine a medical database where you want to create a score based on pharmaceuticals relevant to a particular person and cross reference it with information based on their genome to arrive at the most suitable remedy candidates.

Or…for a financial institution calculate a more relevant and accurate financial risk score tailored to each user.

And why stop at 2 collections? You can call Join as many times as you want to really create intricately cross-linked queries that answer all kinds of interesting relational questions.

As it currently stands, the kinds of Joins you can do in Fauna are somewhat narrow and limited and kind of black and white, and this improvement would open up a massive new way to extract and elucidate meaning from a database and provide access to all the shades of grey that exist, but currently cannot be revealed (efficiently, and at scale).

ptpaterson · March 15, 2023, 2:00pm

Hi @hunp! Index bindings are calculated once, at the time the document is written. Bindings have to be “pure” because side-effects from other documents cannot be accounted for.

I have to say that I really like this idea, but then I don’t think that Indexes are the correct mechanism for this. The purpose of an index is to provide efficient storage and reads by certain values for your Collection; delaying the computation of an index value would amount to leaving it unsorted until after the entire set is computed.

For example, even if you only want the top 3 values, you’ll still have to compute over the entire collection at the time of reading before you can sort and grab the top 3.

So then, what then is the right way to go about it, and what more can Fauna do to enable this use case?

Am I right to guess that MUSIC_GENRE_BY_USER_AND_POPULARITY indexes a different Collection than SONGS_BY_GENRE? Going off of that assumption…

the music genre has a score; and
the songs themselves have a score; and
you want to search for songs by a new “super” score that is the product of the song score and the score of it’s genre.

Because you are reaching across Collections, I don’t think bindings are appropriate for this use case. Changes to music genre can affect the super score for a song, so you need to account for those.

Here is another question: Do you need the search results to be available immediately or can your search be updated asynchronously?

Immediate/synchronous

If you need the search to be updated immediately, and keep efficient searches, then my advice is to:

Always perform updates to songs through a UDF. In this UDF,
a. read the related genre document and get the current score.
b. compute the songScoreCalc and store it in the song document.
c. compute the super score and store it in the song document.
Always perform updates to music genres in a UDF. In the UDF,
a. compute the genrePopScore and store it in the genre document.
b. update all related songs using the song-update UDF to apply the latest super score.
Create an index with the song’s super score as a value

Note: this relies on frequently updating many documents, and possibly some documents multiple times in a short period of time. This can lead to write contention and 409 errors. I strongly advise that you use a work flow that allows massively scalable reads and a single source of writes for applying “likes” and updates to the scores. I’ll touch on this more below.

eventually/asynchronous

If you can delay updating search indexes, then on some scheduled cadence (e.g. every 5 seconds or 1 minute, etc.), run a query that does the following:

Find all songs and genres that have changes in likes
Update the score for all of the genres
Update the scores for all of the songs

Create an index with the song’s super score as a value to efficiently sort and read the songs by the super score.

This helps you scale the writes and compute required to update the genres and those songs related to the genres.

Further scaling the app

It is not a good idea to have frequent updates to individual documents. Doing so can cause write contention and 409 errors.

We have a page in our documentation about applying the “Event-Sourcing” pattern to do things like aggregate statistics such as number of “views”, “likes”, etc.

The basic idea is that instead of writing to the song document immediately when a user “likes” it, you create an new event log that says “+1 like to song X”. On regular intervals, you gather up all of the latest events and apply the new aggregated values all at once.

Since there is only a single source of updates to your documents (the aggregation job that runs only once per interval), the writes are guaranteed to be conflict-free and thus highly scalable.

In the context of your music scores, you could also compute the new scores as part of this aggregation process.

What else can Fauna do?

I’ve focused a lot on how to work around your issue. And to be clear, the event-sourcing pattern is a very well-established workflow for any database and one, in particular, we have worked with a number of customers to implement.

That said, I don’t want to dismiss the core of your request, which I think can be summed up as

Provide an efficient way to sort on a computed field

This feature request is still a great place for additional discussion on that. We would love to hear feedback on the recommendations and any other ideas folks have.

hunp · March 15, 2023, 5:49pm

Correct. MUSIC_GENRE_BY_USER_AND_POPULARITY Indexes a different collection that tracks user’s likes of a genre.

Create an index with the song’s superScore as a value

I believe this may be an important oversight. A song does not have a single superScore. Each superScore of a song is specific to one user answering the question “which popular songs would this user really like?” Using this event sourcing technique, the only way I can think of doing this would be to have a new SUPER_SCORE collection where the superScore is recalculated for all songs for all users any time any of the songs is updated.

If there were 1 million songs and 1 million users, that would be 1 trillion docs all of which would need to be updated any time any of the songs are interacted with by any of the users. Doesn’t feel right to me. Even if I were to make a simplifying assumption and only update the top 250 songs in this model, that would still be 250 Million records needing updating any time a song is liked, not scalable. I’m sure I have not considered all possible efficient solutions, but, I also suspect this may be beyond fauna’s capabilities.

Topic		Replies	Views
Feedback wanted! Advanced dynamic query generation Random fql , indexes	2	129	December 20, 2024
Merge index data from before and after a join Help fql	3	845	July 28, 2020
Index sorting using relationship? Help fql , indexes	2	413	May 4, 2021
FQL Equivalent of SQL WHERE column_name IN (value1, value2, ...);" Help fql	3	388	March 17, 2021
Query performance issues, does not scale with documents Help fql	1	477	March 9, 2021

A proposal to unlock more meaning from queries through index parameters

Immediate/synchronous

eventually/asynchronous

Further scaling the app

What else can Fauna do?

Related topics