Fetching n-th percentile in a dataset

groenlid · November 21, 2020, 11:11am

Hi all!

I’m currently trying to calculate the nth percentage in one of my collections.
I’ve setup an index that returns all the numeric-values in an ascending order.

Then I want to select the n-th number inside that index.
Is that currently possible in an performant way?

I’ve included the way I currently do this by using the paginate with a specific size set and fetching the max value of that.

This seems brittle since it will fail when having collections where the n-th index is over 100 000 documents, and it seems like it will incur a lot of read-operations this way.

Underneath is the way I’m currently doing this using the javascript driver.

async function getPercentileForListOfDates(dates: string[]): Promise<IPoint[]> {
  const client = getClientForDatabase()
  
  // Returns a set of values for a given date
  const valuesOnDate = q.Match(q.Index("values_by_date"), q.Date(q.Var('x')))
  
  // Gets the 95 percentile for a given date, or null if no values are present.
  const percentileOnDate = q.If(q.IsEmpty(valuesOnDate), null, q.Select(['data', 0], q.Max(q.Paginate(valuesOnDate, { size: q.ToInteger(q.Ceil(q.Multiply(q.Count(valuesOnDate), 0.95))) })), null)) 
 
  // Loop over and do this for each of the dates given.
  const query = q.Map(dates, q.Lambda('x', percentileOnDate))

  const result = await client.query<number[]>(query)
  return result.map<IPoint>((value, index) => ({ time: dates[index], value }))
}

Evan · November 21, 2020, 11:55am

You can use Range() to filter the set before fetching it: https://docs.fauna.com/fauna/current/api/fql/functions/range

groenlid · November 21, 2020, 12:01pm

Hi Evan!
Thanks for the quick reply!
I’ve looked into range in the past, but doesn’t range only filter on the actual values in the set, not the position of the value?

Is it posible to use range to fetch the value at position 95 in the set?

ps: Also thanks for fauna. I’ve been enjoying using it for a bunch of projects!

Luigi_Servini · November 24, 2020, 11:53am

Hi @groenlid,

not sure I got what you want to achieve but I’ll give a try.
Let suppose we have documents like this:

Get(Ref(Collection("percentile"), "283071866323599873"))
{ ref: Ref(Collection("percentile"), "283071866323599873"),
  ts: 1606217218615000,
  data: { value: 94 } }

with value between 1 and 100.
Created an index like this:

CreateIndex({name:'percentile_idx',source:Collection("percentile"),values:[{field:['data','value']},{field:['ref']}]})

If your 95th percentile is let’s says, 91.5, you can issue that query:

Paginate(Match('percentile_idx'),{after:91.5})
{ before: [ 91.5 ],
  after:
   [ 97,
     Ref(Collection("percentile"), "283071659304288769"),
     Ref(Collection("percentile"), "283071659304288769") ],
  data:
   [ [ 92, Ref(Collection("percentile"), "283071576885166593") ],
     [ 92, Ref(Collection("percentile"), "283071597942669825") ],
     [ 92, Ref(Collection("percentile"), "283071635938869761") ],
     [ 92, Ref(Collection("percentile"), "283071640233837057") ],
     [ 92, Ref(Collection("percentile"), "283071644427092481") ],
     [ 92, Ref(Collection("percentile"), "283071648316260865") ],
     [ 92, Ref(Collection("percentile"), "283071649374274049") ],
     [ 92, Ref(Collection("percentile"), "283071696272884225") ],
     ......................................

and get all values exceeding 91.5.
Or

Paginate(Match('percentile_idx'),{before:91.5})
{ before:
   [ 86,
     Ref(Collection("percentile"), "283071664053289473"),
     Ref(Collection("percentile"), "283071664053289473") ],
  after: [ 91.5 ],
  data:
     ..................................   
     [ 91, Ref(Collection("percentile"), "283071707461190145") ],
     [ 91, Ref(Collection("percentile"), "283071715579265537") ],
     [ 91, Ref(Collection("percentile"), "283071758315028993") ],
     [ 91, Ref(Collection("percentile"), "283071768534450689") ],
     [ 91, Ref(Collection("percentile"), "283071777605681665") ],
     [ 91, Ref(Collection("percentile"), "283071809134264833") ],
     [ 91, Ref(Collection("percentile"), "283071818660577793") ],
     [ 91, Ref(Collection("percentile"), "283071827317621249") ],
     [ 91, Ref(Collection("percentile"), "283071833321767425") ] ] }

for getting the opposite.
But as you mention, Paginate() has te hard limit to fetch maximum 100k documents. Over that limit, you have to iterate over the cursor.

Hope this answer your question.

Luigi

groenlid · November 24, 2020, 12:54pm

Hi Luigi! Thanks for the quick response!

This is not quite what I want to do either unfortunately.

I have a set of values that is ordered ascending like in your example above. I then want to fetch the value from position 10 inside that set when I don’t know what the values are.

[ 1, 4, 20, 50, 95, 1021, 1222, 1223, 1999, 2000, 2001, 2005 ]

Here I want to get the value 2001 from the set.

Regards,
Mats

Luigi_Servini · November 24, 2020, 2:44pm

Hi @groenlid,

Extracting the 10th element of an array is pretty easy:

 Select([10],[ 1, 4, 20, 50, 95, 1021, 1222, 1223, 1999, 2000, 2001, 2005 ])
2001

but still unsure that’s what you are looking for…It’s probably more complex your goal.

Luigi

groenlid · November 25, 2020, 7:50am

Hi @Luigi_Servini!

Your solution would work if I was working with an array type In my query I’m working directly with the Set I get from the Match operator. Is there a good way to convert from a Set to an Array so I can use the Select operator? And does that incur some performance penalty in doing a conversion?

My index look like this:

And an example data-entry looks like this:

{
  "ref": Ref(Collection("metrics"), "282253522559304199"),
  "ts": 1605436785240000,
  "data": {
    "value": 447.0350000192411,
    "id": "1605436784271-1753921565150",
    "href": "https://webvitals.groenlid.workers.dev/",
    "event_name": "FCP",
    "speed": "4g",
    "device": "desktop",
    "domain": "webvitals.groenlid.workers.dev",
    "pathname": "/",
    "time": Time("2020-11-15T10:39:45.060386Z"),
    "date": Date("2020-11-15"),
    "hour": Time("2020-11-15T10:00:00Z")
  }
}

databrecht · November 25, 2020, 11:04am

Hi @groenlid,

quickly jumping in.

The problem
You are looking for a skip/take equivalent in SQL as far as I can see. You are right that there is no easy way to do that out-of-the-box and it’s an often requested feature. If there is no feature request yet, you can open one in the forums for people to vote on. However, this is a trade off. It’s not unthinkable that Fauna will support this one day but keep in mind that many databases that do support skip/take either end up doing a full scan if you skip a lot of items and become really really slow. In fact, I even have first-hand experience with clients that crash their own database (e.g. even scalable databases like Cassandra) by just querying for “Select * FROM X skip 1000000”. You can easily find an article about this, e.g. this was my first hit when I searched for it. In my experience, some analytic databases (e.g. ClickHouse) are good at that, but they do optimizations behind the scenes at the moment of ingestion and typically are not meant for OLTP (hence they don’t have the same lock problem etc since they might not even offer transactions)

Fauna’s pagination
Fauna aims to not provide the user with any guns that can be used to shoot yourself in the foot hence our pagination works in a scalable fashion & is more accurately than skip&take since it uses a snapshot which makes sure that if new data arrives your pages don’t suddenly change. Fauna’s pagination, depends on the values ordering of your index so you can mimic it with Range(). Depending on what values you would then put in your index you can get the random access you desire. However, this has the disadvantage that the problem is moved to… having these values in your data.

Workaround
There is a possibility to work around this by either keeping incremental ID and using an index with Range. In that case, it would work more like the example solution on that stackoverflow question. That in itself is, of course, not the perfect solution when it comes to distributed scalability since you might have to write to the same document multiple items to achieve this which creates contention. If you don’t risk to have many writes at the same time, you can opt to do that in case pagination is more important than scalability.

As usual in Computer Science, everything is a tradeoff. You can work around that as well by having a rather approximate incremental ID for which you can come up with some workarounds. I’m deliberately staying vague here since I would need to write it out extensively in almost a semi-blogpost if I want to make what I have in mind crystal clear.

In that solution to speed their pagination up, they use a sorted value (what I suggested as well with the disclaimer of the disadvantages). You can do exactly the same in Fauna with Range() as Evan suggested and if you do, you are in some way mimicking what our pagination does. Of course, everything depends on what value you sort on. If you need exact access of a certain page, then you need to have a sorted value to access that page and you essentially move the problem to having a value in your data that suits your needs.

Conclusion/Disclaimer
Just to be clear, this is my understanding, I don’t work on the core product, so my understanding might be slightly off. It would make an interesting blog post though. If anything in my answer confuses you, feel free to poke me

The consistent recurring theme when it comes to architecting your solution in Fauna is that Fauna:

Protects you from making a choice that is not scalable
Provides you with all the basic building blocks and options to do what you want if you want to pursue a specific solution anyway.

At this point we do not try to be clever and offer high-level features that might be useful but not scalable or performant at all behind the scenes. In many ways we give you raw access to the full performance of the database. That doesn’t mean we will never support it though or that there are no ways to support this efficiently so feel free to open a feature request in the forum.

Luigi_Servini · November 25, 2020, 12:57pm

Hi @groenlid,

we don’t actually have the percentile function out-of-the-box, so I tried to put together some FQL to calculate and extract the element corresponding to the nth percentile:

Let(
  {
    percentile: 95,
    data: Paginate(Match('percentile_idx'),{size:100000}),
    count: Count(Match('percentile_idx')),
    percentile_idx: ToInteger(Multiply(Divide(ToDouble(Var('percentile')),ToDouble(100)),ToDouble(Var('count')))),
    result: ToObject([[Concat([ToString(Var('percentile')),"_percentile"],''),Select([Var('percentile_idx')],Var('data'))]]),
  },
  Var('result')
)

The percentile_idx just returns a value. You can change the percentile value and returns the percentile you want.

The hard limit here is still in pagination (100k documents maximum), but your index looks very selective and might return documents that not break that limit.

Let me know if we are on the right way or if I have to change something.

Luigi

Topic		Replies	Views
Finding the returning the nth document stored in an index Help fql , python	3	463	May 4, 2021
How to get nth page when querying documents? Help best-practices , pagination	1	755	March 24, 2022
How to retrieve a collection using an index to sort the results Help javascript , netlify , lambda	5	582	May 5, 2021
Paginating indexes with after Help fql , best-practices	1	540	August 31, 2020
Sum and sort by total Help sorting	6	248	December 8, 2022

Fetching n-th percentile in a dataset

Related topics