Social media/Fwitter user view checker extension. Index on document exists/Indexing empty documents. Speeding up queries which check if a relationship exists

I have a data structure modelled off of the fwitter example. With three important collections, fweets, users, and fweetstats. Most importantly fweetstats documents contain a ref to a user and a fweet.

I need to build a fast index fweets_by_rank, that excludes posts that a user has seen.

How do I specify the fweets_by_rank index so that it can:

  1. Take in a user ref as an input.
  2. Then return only fweets which do not have a fweetstat document which references the supplied user ref and the found fweet doc.

I have tried using Difference() at runtime to compare fweets_by_rank and fweetstats_by_user and it works, but this is a suboptimal query that will take a long time once the dataset grows.

Hello @Hunter_K!

Definitely! Difference is great for small, bounded Sets. It has to fully read each of the input Sets to execute, so indeed it can be problematic for Sets that will keep growing as you scale. So let’s dig in to other ways we might accomplish your search.

First, one more quick mention that the Fwitter example suffers from a scalability problem that I want to discuss, because I think reviewing that will also help to add perspective to your question. The issue is with how it manages stats on the Fweet: each action a user makes updates the fweet and fweetstats documents. This is fine when there are only an occasional impression on a single fweet. However, once you have spikes of even a handful of requests trying to modify the same document, you will run into aborted transactions due to contention.

The typical solution is not easily covered at the same time as demonstrating all of the other sophisticated things you can do with Indexes in Fauna, which the fwitter example focuses on. I am referring to the “Event Sourcing Pattern” which we describe in detail here:

The TLDR; is that you cannot frequently write to a single document from multiple source. Instead, the various sources should write to a kind of log Collection, while a background process reads the logs and acts as a single source of writing the aggregated values.

So what does this have to do with filtering tweets already seen by a user? I have a few thoughts:

  • Doing pure Index operations (bindings plus things like Difference) on unbounded data is not possible in a scalable, efficient way. It is better to narrow your search as much as possible and then do some simple/fast computation on those results.
  • You can use the event sourcing pattern to update a single document that caches a list of things to filter (already viewed, blocked users, etc.). Then you can use that list to filter results more efficiently than fetching the list with a indexes every time.
  • Use some cheap monotonic (e.g. an always increasing number) metric that you can use that you know means the user has not seen it, or least makes it more likely. For example, if you only want to show fweets with a rank higher than the most recent fweet in the feed, you know that you can pre-filter the new fweets (and those in the deny/filter-list) by rank first to narrow down the list of possible ones to show.

That’s not a concrete solution for you, but I hope it helps guide you in the right direction!


See also:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.