News feed - posts of all people user follows (plus back of the envelope calculations)

I’m about to write simple news feed on Fauna.
Its just list all posts from people i follow.

Ideas that i had to solve it would be:

  1. Store references of following users in documents. (fan out on write)
  2. Store Follower-Followee relation, get all followees by user then get index: documents_by_followees
  3. Store followees in user profile in one array field then query documents_by_followees with array from that field

My understanding is that for relation example if person follows 1000 others then Union would require 1000 read-ops for query.

The normalized version (first idea) would require 1 read but each document that is followed by 1000 people would be - assuming that one ref is 30bytes
1000 * 30 bytes =30 kilobytes

So Which is the best cost wise? What are other ways to approach it?

It seems that fanout on write is better financially 44 times

1000 * 30bytes * 30 * 100000 (1000 followers * 30byte/ref * 30 posts * 100000 such people) = 90 gigabytes = $15.30

100000 * 300 * 15 (100000 people with 1000 followers * 300 unique followers * 15 times viewed per month) = 450,000,000 read ops = $223.50 per month.

Unfollowing
Assuming 100,000 people with 30 posts each would be unfollowed by 1% of 1000 followers , 30M writes per month, cost of writes =$57.00

This totals to around $600/m (with bandwidth included)
So it seems that fanout on write is a way to go. Could somebody shed more light on that?

If you read more than you write, it stands to reason that simplifying reads with writes would pay dividends.

1 Like

Following on prev posts i have first prototype that uses followers array of Refs.
When author posts new content all his followers are written into that specific content.

This setups benefit is that it requires 1 read (get followers array), 1 write (to the new content - followers array) and allows to search using index on followers with also just 1 read (per page).

The problem is adding too many follwers. In docs it states that maximum request size is 1MB. So if there will be too many followers i think that fauna would reject request.

Let(
	{
		postref: Ref(Collection("posts"), "269334440698708486"),
		post: Get(Var('postref'))
	},
	Update(Var('postref'),
	{
		data:
		{
			followers: Append([1, 2, 4],
				Select(['data', 'follwers'], Var('post'))
			)
		}
	})
)

Will that happen? Or is this kind of Append/Select somehow working in the background?
@ben

An array of followers isn’t going to scale to 1000s of followers, that’s for sure. If I were trying to develop such an application I’d definitely have model a follow as a separate document {follower, followee}. Then when someone tweets I’d probably have a udf that pulls a page of followers(say 1k), writes a ref of the tweet to all their timelines, and if there are any left over followers create a task with cursor to the next page and the ref of the tweet and pass the task ref back to the calling process as the result of the query. The caller can then use that as a continuation to keep going until all the followers have the tweet.

Yes, this sounds complicated. It is. BUT it is resilient to failure, and scalable to any number of followers.

1 Like

Hey @ambrt,

Request limit

We are talking about a request size limit, not a document limit so your code will probably not fail unless that array to append is going to be too big but I assume you will never add 1000s of followers at once? However, is it a good idea to do this? No! :slight_smile:

When to use arrays, when to use a collection

Ben is right (of course he is :smiley:) that using an array to store links will not scale, look at it from the perspective of the complexity to add a new item or remove an item to that array:

  • Adding an item to the array will override the whole array again.
  • What if you want to remove one? You’ll have to find the reference in that array, remove it and store the new array again.

If I’m not mistaking how that would work internally, this would mean that each time, you are storing that whole array again which is not efficient. If you would write an index on this followers array FaunaDB would unroll this array and you’ll have an index entry per follower (which is pretty cool). That means that each time you update this array that this index has to update all of these impacted entries in your index. It would be way more efficient to have a separate collection for that.

In terms of guidelines what to use when:

  1. Use arrays when you are storing a few items and your data won’t change too often
  2. Use a separate collection when that data is going to change often and/or it’s going to become big.

Example

Did you check the Fwitter repository? The code has such an example. For example, here you can see the collection ‘followerstats’ being made which is essentially a link between two users.

follower <-> followerstats <-> author

It also has the advantage that you can store extra data on there about their relation.
An initial article that introduces Fwitter can be found here.d

1 Like

Thank You guys for replies.

Didn’t know that array is for small sets.
I think i will go route that @ben described.

@databrecht I’ve read into Fweets repo a bit more and it’s wealth of knowledge about Fauna there.

The way to think about arrays is that documents are limited in size, and arrays are scoped to documents. Even if we made array updates a little nicer in fql (we probably should) it would still pin your total follower size to the size of the document.

So array approach is not scale-able.

For now i understand that there are two proposed ways:
Write new content to each follower’s own timeline,
or
Write in collections and fetch each by refs (a Fweets repo approach)

The first approach is heavy on write and light on reads.
The second is opposite (light on writes and heavy on reads).

There might be another approach.
I’m not sure but here it goes:

Lets say that there is index fweets_by_author.
Then the collection of {author,follower} with index authors_by_follower.

Paginate(
  Join(
    Match(Index("authors_by_follower"), Get(Identity()), 
    Index("fweets_by_author")
   )
)

One could normalize fweets_by_author in a way that data
(likes, comments, etc) would be returned from index call

This way write is done just once and it takes only one read per pagination (according to web-shell for query above)

Fweets repo is getting each fweet by
Get(Var('ref'))
sent to
GetFweetsWithUsersMapGetGeneric.

Fetching 30 fweets is doing at least 30 read-ops.

So what do you think? Would it work this way or is had some major flaw?

Inlining scalars into indexes is a smart optimization move! Your queries will cost you less on the read side and evaluate faster to boot. You do an extra write per index added, but that often scales better than complex read expressions. I can’t comment on the overall approach because I don’t have the time to spare to think too hard about it but it sounds initially like a decent approach to me.

1 Like