Counting distinct values within (time) range

Hi,

my collection contains of data with a value and a timestamp. The same value can appear multiple times (with different timestamps).

My current indexes are:

Index 1

{
  name: "by_time",
  unique: false,
  serialized: true,
  source: "data",
  values: [
    {
      field: ["data", "time"]
    },
    {
      field: ["ref"]
    }
  ]
}

Index 2

{
  name: "by_value",
  unique: false,
  serialized: true,
  source: "data",
  values: [
    {
      field: ["data", "value"]
    }
  ]
}

I have a query counting all distinct values by using index 2.
And I have a query counting all values within a time range using index 1.

What I am struggling with is counting all distinct values within a selected time range.
I tried to use intersect and join but could not figure it out.

Add an index, value_by_ref and join with it.

{
  name: "value_by_ref",
  source: "data",
  terms: [
    {
      field: ["ref"]
    }
  ],
  values: [
    {
      field: ["data", "value"]
    }
  ]
}

Then filter the time with Range and Join. The value_by_ref Index should already reduce the results to be unique.

Count(
  Join(
    Range(
      Match(Index('by_time')),
      TimeSubtract(Now(), 10, "day"),
      Now(),
    ),
    q.Lambda(
      ['time', 'ref'],
      q.Match(q.Index('value_by_ref'), q.Var('ref')),
    )
  )
)

Here is another way without using a Join.

Query: Get the count of distinct employeeID for all orders with orderdate between (‘1998-05-01’, ‘1998-05-31’)

Count(
  Distinct(
    Map(
      Map(
        Select(
          ['data'],
          Paginate(
            Range(
              Match(Index('all_orders_orderdate_asc')),
              ToTime('1998-05-01T00:00:00.000Z'),
              ToTime('1998-05-31T00:00:00.000Z')
            ), {
              size: 100000
            }
          )
        ),
        Lambda(
          'x',
          Select([1], Var('x')))
      ),
      Lambda(
        'x',
        Select(['data', 'employeeID'], Get(Var('x'))))
    )
  )
)

Index Definition

Get(Index('all_orders_orderdate_asc'))
{
  ref: Index("all_orders_orderdate_asc"),
  ts: 1608657782276000,
  active: true,
  serialized: true,
  name: "all_orders_orderdate_asc",
  source: Collection("orders"),
  values: [
    {
      field: ["data", "orderDate"]
    },
    {
      field: "ref"
    }
  ],
  partitions: 8
}

Sample Order Data from NorthWind dataset.

  {
    ref: Ref(Collection("orders"), "285555555991307776"),
    ts: 1608589145100000,
    data: {
      orderID: 11076,
      customerID: "BONAP",
      employeeID: 4,
      orderDate: Time("1998-05-06T00:00:00Z"),
      requiredDate: Time("1998-06-03T00:00:00Z"),
      shipVia: 2,
      freight: 38.28,
      shipName: "Bon app'",
      shipAddress: {
        street: "12 rue des Bouchers",
        city: "Marseille",
        region: "NULL",
        postalCode: 13008,
        country: "France"
      },
      details: [
        {
          productID: 14,
          unitPrice: 23.25,
          quantity: 20,
          discount: 0.25
        },
        {
          productID: 19,
          unitPrice: 9.2,
          quantity: 10,
          discount: 0.25
        },
        {
          productID: 6,
          unitPrice: 25,
          quantity: 20,
          discount: 0.25
        }
      ]
    }
  }

Since the “Map Map” example relies on Get it costs O(N) Read Ops. The Join method seems like always exactly 1 Read Op.

More looking into the cost, I entered in 100 items with Now() as time and values between 1 and 10.

For “Join” method:
image

For “Map Map Get” method:

I don’t know if byteReadOps effect billing. They scale with both methods, but are still much higher with “Map Map Get”.

Thanks, @ptpaterson. byteReadOps is the new ReadOps.
I was sharing an example to show how multiple FQL queries can achieve the same results.

Users should always check for queryTime, byteReadOps, byteWriteOps, computeOps metrics as they indicate Performance, Usage and Billing.

Thank you both for your input. Each has its own benefits.

@ptpaterson
I’ve seen similar solutions using an extra index on other topics and could follow the provided solution.
In this case I don’t quite understand how/why the following works:

Maybe you could shed some light on this.

My current understanding is:
If I filter the following data using Range to only list this years entries by using the by_time index

{ value: 400, time: 2020-11-05... , ref: 100 }
{ value: 300, time: 2021-01-08... , ref: 101 }
{ value: 200, time: 2021-01-08... , ref: 102 }
{ value: 300, time: 2021-02-22... , ref: 103 }

then I would end up with something like

{ time: 2021-01-08... , ref: 101 } // value 300
{ time: 2021-01-08... , ref: 102 } // value 200
{ time: 2021-02-22... , ref: 103 } // value 300

Now comes the part I don’t understand:
If the value_by_ref index holds all the refs of the dataset and I feed the Lambda with the times and refs above, how is one representing a value 300 filtered out?

Shouldn’t every ref find a match in the value_by_ref index?

It seems like I am missing something really basic here…

I honestly don’t know. I was expecting to wrap the result in Distinct before Count but only realized that was unnecessary as I was building the query and saw the results.

The docs for Join don’t go into depth, or at least imply that your/our assumptions should be correct.

This I must defer back to @Jay-Fauna . Which by the way…

Thank you! This makes sense – I was certain new billing should be higher and that explains it all. The billing page in docs does not contain the word “byteReadOps”, and I skimmed over the differently-named header, x-byte-read-ops.