Query data field, statistics

Fyriif · February 9, 2021, 11:10am

Hi,

I am trying to generate some statistics from match data (JSON:https://api.opendota.com/api/matches/5132512783)

I have a table of these matches and I am trying to find out for example which is the most picked hero or most banned which would need to be counted from using hero_id and picks_bans:
picks_bans": [
{
“is_pick”: false,
“hero_id”: 17,
“team”: 1,
“order”: 0,
“ord”: 0,
“match_id”: 5212816885
},
{
“is_pick”: false,
“hero_id”: 69,
“team”: 0,
“order”: 1,
“ord”: 1,
“match_id”: 5212816885
},

as well as check top stats such as most kills which would need to be extracted and counted from:
“benchmarks”: {
“gold_per_min”: {
“raw”: 390,
“pct”: 0.19154557463672392
},
“xp_per_min”: {
“raw”: 468,
“pct”: 0.12681638044914134
},
“kills_per_min”: {
“raw”: 0.05689900426742532,
“pct”: 0.08982826948480846
}

these are all part of the match JSON object and I am wondering if Fauna is even a good choice for these kind of applications?

Any help is greatly appreciated.

databrecht · February 16, 2021, 1:06pm

You probably want to tell us what kind of document we are looking at.
Are these properties on some kind of ‘player’ document?

databrecht · February 16, 2021, 2:34pm

I can already say this though.

Aggregations

There are limits in place due to the distributed/scalable nature. Since Fauna is an OLTP database and not an OLAP, these kinds of aggregations are (currently) not its strength. You can write it, but you’ll bump into pagination or computation limits probably when your data size grows. More about how to do that can be found in related posts such as:

SQL group by, having counterpart in FaunaDB - #2 by databrecht
Would you use Fauna for analytics? - #2 by Jay-Fauna
Or how you could bump into the computation limit and the workaround:
Can't count collections with more than 600 000 documents - #2 by databrecht

In your case, for your aggregation you would probably (I didn’t think this through so it might not be the best way) start with all hero ids. Map over that and then per hero get all heroes that were picked by using an index on is_pick and hero_id as terms and count that. That way you will get all counts within a page and could sort externally… or could opt to continue paginating if the result of 100 000 max per page is not enough (you have more players) and then sort afterwards.

Alternative approach 1 - precalculate

Instead of aggregating on the fly, you could store aggregations. For example you could:

Opt to store a document per hero and increase the count each time a hero is picked. This is a tradeoff since if many users would write to that document at the same time, that’s a bad idea. There are techniques to reduce the amount of writes to the same document, e.g you could have 10 documents and randomly pick one.
Opt to store a count of picks in the document you provided above, which if I’m correct that this is a player, means you have a picked hero count per player. You could also store that in a separate document though that links back to player and hero. This is an intermediate solution, writes are separated over multiple documents (to which you write anyway) yet you have to count less compared to counting over all picks_bans in te array.

Alternative approach 2 - stream to external system

Stream data into an engine that is good in aggregations (RockSet / Clickhouse), you could use temporality to get changesets and stream it into such a system.

Fyriif · February 16, 2021, 3:10pm

I don’t think I will be running in any limits, I am already filtering the matches on league_id (matches belonging to tournament) and then further to filter them on date/time + region they were played in. A tournament will at most have 500 matches and I am only looking to generate statistics once, after tournament concluded and store the report in a separate table in Fauna to use in a web-app.

The pick_bans is a property of the match document which contains all possible information about the match.

I was looking at: data aggregation and hoping to get similar to:

picks_bans: [
    { hero_id: '17', total_picks_bans: 10,  total_picks: 4,  total_bans: 6},
     { hero_id: '54', total_picks_bans: 5,  total_picks: 2,  total_bans: 3},
     { hero_id: '2', total_picks_bans: 22,  total_picks: 13,  total_bans: 9}
  ]

but I am not sure if my way of thinking is the best where I have an array variable of picks_bans and unique_pick_bans and map and filter for pick/ban?

databrecht · February 16, 2021, 3:19pm

I would avoid arrays personally, you are going to update that document for each user that selects a hero. That means, many writes to one document, all these writes have to put in a certain order (serialized) which will slow down writes. I would at the very east make a hero_statistics collection or something and store these separately per hero instead of using an array.

Fyriif · February 22, 2021, 5:58pm

When I made an index with terms is_pick and hero_id, I think due to the picks_bans array being nested in the match document I get data duplication

For example if I create an Index “picks_bans_by_match_id” that returns hero_id and query it with match_id 5212816885 and is_pick false then I would get all hero_ids wether they are is_pick true OR false.

{
  name: "picks_bans_by_match_id",
  unique: false,
  serialized: true,
  source: "Matches",
  terms: [
    {
      field: ["data", "match_id"]
    },
    {
      field: ["data", "picks_bans", "is_pick"]
    }
  ],
  values: [
    {
      field: ["data", "picks_bans", "hero_id"]
    }
  ]
}

Would you happen to have a suggestion to fix that - am I creating indexes wrong?

Fyriif · February 22, 2021, 9:45pm

Alternatively I have tried to create an Index with Bindings where I create a binding of filtered picks_bans where is_pick= true and then use that array binding picks as a term:

  name: "match_id_picks",
  source: {
    collection: Collection("Matches"),
    fields: {
      picks: Query(
        Lambda(
          "matchDoc",
          Filter(Select(["data","picks_bans"], Var("matchDoc")), Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), true)))
        )
      )
    }
  },
  terms: [
    {
      field: ["data", "picks","hero_id"]
    }
  ],
  values: [
    { binding: "picks"}
  ]
})

Unfortunately, I am not getting any data returned for some reason?

databrecht · February 23, 2021, 2:25pm

Hey @Fyriif , since your questions are very domain specific, try to provide the full definition each time and try to make sure that we can just run the queries to get the data you have.

CreateCollection(...)
CreateIndex(...)
Create(Collection(..),  < your data> )

And test whether it gets created correctly. That will make it much faster for us to respond to your questions. The answer to this particular question is that your index is incorrect, you have written “data” “picks” “hero_id” while according to your data model it needs to be “data” “picks_bans” “hero_id”.

Try to reduce your tests to simpler cases in case you are stuck. For example, try to test first with:

CreateIndex({
  name: "match_id_picks2",
  source: {
    collection: Collection("test")
  },
  terms: [
    {
      field: ["data", "picks", "hero_id"]
    }
  ]
}

and you would have seen that this one already doesn’t work and that it’s not necessarily related to your binding. Once you fix that, separate out your binding and test it separately for a given document e.g.

Filter(
  Select(
    ["data", "picks_bans"],
    Get(Ref(Collection("test"), "291325376922124801"))
  ),
  Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), true))
)

Clearly your binding is correct.

[
  {
    is_pick: true,
    hero_id: 30,
    team: 0,
    order: 1,
    ord: 1,
    match_id: 5050331668
  }
]

Fyriif · February 23, 2021, 2:36pm

Hi,

I think I explained my problem wrong so I do apologize:

In my collection I have for example two documents:

{
  match_id: 5132545461,
  picks_bans: [
    {
      is_pick: true,
      hero_id: 105,
      team: 0,
      order: 0,
      ord: 0,
      match_id: 5132545461
    },
    {
      is_pick: false,
      hero_id: 66,
      team: 1,
      order: 1,
      ord: 1,
      match_id: 5132545461
    },
  ],
}

and

{
  match_id: 5132545462,
  picks_bans: [
    {
      is_pick: true,
      hero_id: 115,
      team: 0,
      order: 0,
      ord: 0,
      match_id: 5132545461
    },
    {
      is_pick: true,
      hero_id: 66,
      team: 1,
      order: 1,
      ord: 1,
      match_id: 5132545461
    },
  ],
}

So I create an index that should return document that has int hero_id and bool is_pick i.e hero_id 66, is_pick true which should return 1 of the documents.

{
  name: "hero_id_is_picked",
  unique: false,
  serialized: true,
  source: "Matches",
  terms: [
    {
      field: ["data", "picks_bans", "hero_id"]
    },
    {
      field: ["data", "picks_bans", "is_pick"]
    }
  ]
}

but the index would return two of the documents, ignoring the boolean value.

One of the solutions I tried to do was to create a binding index to filter the picks_bans to create array of just picks or bans (depending on index) like so:

CreateIndex({
  name: "picks_hero_id",
  source: {
    collection: Collection("Matches"),
    fields: {
      picks: Query(
        Lambda(
          "matchDoc",
          Filter(Select(["data","picks_bans"], Var("matchDoc")), Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), true)))
        )
      )
    }
  },
  terms: [
    { binding: ["picks","hero_id"]}
  ],
  values: [
    { binding: "picks"}
  ]
})

But this returns no data matches.

Ultimately my goal is to filter match documents on other criteria such as league_id, start_time and cluster and then see how many of certain hero_id picks or bans there are.

Fyriif · February 23, 2021, 2:40pm

Also my current solution is to do the follows:

Let(
  {
    matches: Map(
      Paginate(Match(Index("matches_by_leagueid"), 11517)),
      Lambda("ref", Get(Var("ref")))
    ),
    pbs_data: Reduce(Lambda(["accumulator", "value"], Append(Var("value"), Var("accumulator"))), Array(), Map(Var("matches"),Lambda("match", Select(["data", "picks_bans"], Var("match"))))),
    heroes_picks: Map(Filter(Select(['data', 0], Var("pbs_data")), Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), false))), Lambda("pick", Select("hero_id", Var("pick")))),
    heroes_bans: Map(Filter(Select(['data', 0], Var("pbs_data")), Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), false))), Lambda("pick", Select("hero_id", Var("pick")))),
    unique_heroes: Distinct(Union(Var("heroes_picks"),Var("heroes_bans"))),
    pbs: Map(
      Var("unique_heroes"),
      Lambda(
        "hero",
        Let(
          {
            p_heroes: Filter(Var("heroes_picks"), Lambda("heroid", Equals(Var("hero"), Var("heroid")))),
            b_heroes: Filter(Var("heroes_bans"), Lambda("heroid", Equals(Var("hero"), Var("heroid")))),
          },
          {
            hero: Var("hero"),
            total_picks: Count(Var("p_heroes")),
            total_bans: Count(Var("b_heroes")),
            popularity: Add(Count(Var("p_heroes")),Count(Var("b_heroes"))),
          }
        )
      )
    ),
  },
  {
    radiant_wins: Select(["data",0],Count(Filter(Var("wins"), Lambda("win", Equals(Var("win"), true))))),
    dire_wins: Select(["data",0],Count(Filter(Var("wins"), Lambda("win", Equals(Var("win"), false))))),
    heroes_picks_bans_popularity: Var("pbs")
  }
)

where I get all the matches for league_id and use Map function to filter and thus count hero_id occurrences in either the heroes_picked or heroes_banned arrays but of course I get quite close to the 12000 compute ops limit

databrecht · February 23, 2021, 4:07pm

That’s how indexes combine multiple values from arrays. If you want to get a clear view of that you could throw them in values instead of terms and see hwo these values are combined. That’s definitely not what you are looking for since it generates combinations for all possibilities.

Your other solution with

seems a bit strange to me and I can’t execute such an index in the shell.

Isn’t what you are looking for rather:

CreateIndex({
  name: "picks_hero_id4",
  source: {
    collection: Collection("test"),
    fields: {
      hero_picked: Query(
        Lambda(
          "matchDoc",
          Map(
            Filter(
              Select(["data", "picks_bans"], Var("matchDoc")),
              Lambda(
                "pick_ban",
                Equals(Select("is_pick", Var("pick_ban")), true)
              )
            ),
            Lambda("x", Select(["hero_id"], Var("x")))
          )
        )
      )
    }
  },
  terms: [{ binding: "hero_picked" }],
  values: [{ field: "ref" }]
})

Which works (I think)

If you do need to combine values from multiple arrays.

If you need to combine values (but I don’t think you do), you can workaround this by combining the values. For example the following statement would combine them:

Map(
  Filter(
    Select(
      ["data", "picks_bans"],
      Get(Ref(Collection("test"), "291329370225639941"))
    ),
    Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), true))
  ),
  Lambda('doc',Concat([
    ToString(Select(['is_pick'], Var('doc'))),
    "___",
    ToString(Select(['hero_id'], Var('doc')))
  ]))
)

Which you could then throw in an index binding:

CreateIndex({
  name: "picks_hero_id",
  source: {
    collection: Collection("test"),
    fields: {
      picks: Query(
        Lambda(
          "matchDoc",
          Map(
            Filter(
              Select(
                ["data", "picks_bans"],
                Var('matchDoc')
              ),
              Lambda("pick_ban", Equals(Select("is_pick", Var("pick_ban")), true))
            ),
            Lambda('doc',Concat([
              ToString(Select(['is_pick'], Var('doc'))),
              "___",
              ToString(Select(['hero_id'], Var('doc')))
            ]))
          )
        )
      )
    }
  },
  terms: [
    { binding: "picks"}
  ]
})

However, you don’t need those true values right you have already filtered them out in the binding? For example, throwing it in ‘values’ instead of terms we now see have an index only on the true values:

But in case you do need to combine vlaues, the above solution could work and you could then search for true__105 in the terms to receive all references that picked the hero.

I hope I’ve been of help, I might not be able to help further though since this is really starting to go deep into a specific application domain.

Fyriif · February 23, 2021, 4:23pm

Hi,

Thank you for the help, the index binding did the exact thing I was looking for.

Just one more question, the index binding, how am I able to Join for an index with league_id? So that I can count the hero picks in matches that belong to a certain league_id?

Fyriif · February 23, 2021, 4:30pm

would it be via:

Count(Intersection(
  Match(Index("picks_hero_id_final"),86),
  Match(Index("matches_by_leagueid"),11280)
  ))

databrecht · February 23, 2021, 4:51pm

You will need t o provide us with much more details on the different document structures involved to answer that question.

Fyriif · February 23, 2021, 6:24pm

the match file example from earlier also has the following attributes

“leagueid”: 11280,
“start_time”: 1569981543,
“cluster”: 236,

How would I go by only getting matches that have inputted leagueid, range of start_time and range of cluster?

I tried to do

Let(
 {
   matches:Intersection(
     Match(Index("matches_by_leagueid"),11517),
     Range(Match(Index("matches_by_start_time")),1574971461,1575232551),
     Range(Match(Index("matches_by_cluster")),131,138),
   )
 },
 {
   counted_m: Count(Var("matches")),
 }
)

but I am not getting matches even if I know they exist? Am I meant to be using Join?

Fyriif · February 23, 2021, 6:32pm

my indexes are as follows:

cluster:

{
  name: "matches_by_cluster",
  unique: false,
  serialized: true,
  source: "Matches",
  terms: [
    {
      field: ["data", "cluster"]
    }
  ]
}

leagueid:

{
  name: "matches_by_leagueid",
  unique: false,
  serialized: true,
  source: "Matches",
  terms: [
    {
      field: ["data", "leagueid"]
    }
  ]
}

start_time:

{
  name: "matches_by_start_time",
  unique: false,
  serialized: true,
  source: "Matches",
  terms: [
    {
      field: ["data", "start_time"]
    }
  ]
}

databrecht · February 23, 2021, 7:15pm

That depends on your index values, this is explained in this stackoverflow topic: How to query by multiple conditions in faunadb? - Stack Overflow

Your indexes are not correct, range indexes only work on values, not on terms. Once you use values, you will notice that intersections won’t work as you worte it there, since you can only intersect on things that return the same values. Therefore, you will need join to get only the reference out of there. Not ideal, we are looking into better ways for future versions of FQL. This is explained in that stackoverflow question.

Question from me: are you building a game that we will be able to play?

Fyriif · February 23, 2021, 10:30pm

Hi,

Not building a game, I am making a web-app that displays stats from game called DotA2.

So I was able to get the join/intersection working by following the link, my only question with it is why doesn’t it work on non-range match?

Range(Match(Index("by_leagueid")),11280,11280),
works but
Match(Index("by_leagueid"),11280,11280),
doesn’t

Also with my previous question that you’ve answered about the index bind:

CreateIndex({
  name: "picks_hero_id_final",
  source: {
    collection: Collection("Matches"),
    fields: {
      hero_picked: Query(
        Lambda(
          "matchDoc",
          Map(
            Filter(
              Select(["data", "picks_bans"], Var("matchDoc")),
              Lambda(
                "pick_ban",
                Equals(Select("is_pick", Var("pick_ban")), true)
              )
            ),
            Lambda("x", Select(["hero_id"], Var("x")))
          )
        )
      )
    }
  },
  terms: [{ binding: "hero_picked" }],
  values: [{ field: "ref" }]
})

how would this be adapted to be intersected with:

matches: Intersection(
      Join(
        Range(Match(Index("by_leagueid")),11280,11280),
        Lambda(['value', 'ref'], Match(Index('ref_by_ref'), Var('ref'))
      )),
      Join(
        Range(Match(Index("by_cluster")),231,236),
        Lambda(['value', 'ref'], Match(Index('ref_by_ref'), Var('ref'))
      ))
    ),

I tried the following but get data [0] even though I should have had multiple values returned:

Let(
  {
    matches: Intersection(
      Join(
        Range(Match(Index("by_leagueid")),11280,11281),
        Lambda(['value', 'ref'], Match(Index('ref_by_ref'), Var('ref'))
      )),
      Join(
        Range(Match(Index("by_cluster")),231,236),
        Lambda(['value', 'ref'], Match(Index('ref_by_ref'), Var('ref'))
      ))
    ),
    match_docs: Map(
      Paginate(Var("matches")),
      Lambda("ref", Get(Var("ref")))
    ),
    
  },
  {
    hero_count:Count(Intersection(
      Join(
        Range(Match(Index("picks_hero_id_final")),86,86),
        Lambda(['value', 'ref'], Match(Index('ref_by_ref'), Var('ref'))
      )),
      Var("matches"),
  
    )),
  }
)

databrecht · February 24, 2021, 5:16pm

Match works on terms
Range works on values
I would read through these tutorials Getting started with FQL, Fauna’s native query language - part 1

There is a great tutorial on indexes in there

Sorry but I currently can’t jump deeper into the code. I tried providing you with all the necessary details but can’t further debug code else I won’t get other work done. I hope you understand.

Fyriif · February 24, 2021, 5:39pm

Thank you,

No worries I realize that you’ve been more than helpful!

Topic		Replies	Views
SQL translation into FQL Query Help	4	453	October 21, 2020
Bulk inserts Help	5	2279	July 27, 2020
FaunaDB FQL - How to create JSON object and store it? Help fql	2	798	May 4, 2021
UDF to get an aggregated average split by multiple categories across nested documents Help graphql , fql , udfs	1	614	March 5, 2021
SQL group by, having counterpart in FaunaDB Help	3	2654	May 4, 2021

Query data field, statistics

Aggregations

Alternative approach 1 - precalculate

Alternative approach 2 - stream to external system

If you do need to combine values from multiple arrays.

Related topics