Index on sub documents

Hello !

I wonder how I can populate an index with subdocuments (or nested embedded documents).
If I remember well, in couchdb, you can take a document, write a code that yields key-values exactly how you want.

Here, I can do something similar with bindings, but a binding is defined in the “source” of the document, and is run on the entire document. So it seems to me that it is not possible to have any link between several bindings that are used in terms and/or values.

I tried to illustrate this :

Take those two documents in a receipt collection :

{
  "ref": Ref(Collection("receipts"), "333175650266382400"),
  "ts": 1654006929960000,
  "data": {
    "cashier": "Josh",
    "articles": [
      {
        "type": "dress",
        "price": 22,
        "description": "A red dress"
      },
      {
        "type": "shoes",
        "price": 32,
        "description": "A pair of basketball shoes"
      },
      {
        "type": "dress",
        "price": 40,
        "description": "A more expensive dress"
      }
    ]
  }
}

{
  "ref": Ref(Collection("receipts"), "333175709185867840"),
  "ts": 1653999966710000,
  "data": {
    "cashier": "Rebecca",
    "articles": [
      {
        "type": "dress",
        "price": 25,
        "description": "A blue dress"
      },
      {
        "type": "belt",
        "price": 21,
        "description": "A lether belt"
      }
    ]
  }
}

I would like to do index the articles themselves, not the receipts.
For instance, I do not manage to do an index whose terms is the article type, and the values the article description (the use case would be to list all the descriptions for all dresses, among all receipts).

Of course this would be easier by just doing two seperate collections, one for receipts, one for articles. But the beauty of document oriented database would be to store your information and be able to implement new use cases later. And even further : it could matter for your use case that a receipt is atomic.

This is not a real use case; I tried to come up with an example that is easy to understand.

Here is something I tried, but it makes a cartesian product (as the doc says it does, so I’m not surprised).

{
  name: "test",
  serialized: true,
  source: "receipts",
  terms: [
    {
      binding: "article_types"
    }
  ],
  values: [
    {
      binding: "article_descriptions"
    }
  ]
}

This allows filtering on the keys I want, but returns too much values, since I’m applying my value binding on all the document, and not just the sub-documents I’m interrested into.

Is there another way of doing it ?

Would it not be cool to also be able to define an index with a lambda that returns a list of (terms, values) tuples ??

Fauna indexes only support scalar values, such as numbers, strings, references. Objects are not supported. Arrays are supported, but you get one index entry per array item. If you attempt to index two or more arrays, the number of index entries is the Cartesian product of all entries in the arrays.

To search for “complex” data, you’ll need to flatten the sub-documents into an indexable value.

To retrieve the sub-document, depending on how you flatten the sub-documents, it might be possible to recompose the original object structure. However, it would likely be easier to simply fetch the document and access the sub-document from there.

Or, as you mentioned, using separate collections would be better than attempting to manipulate sub-documents to conform with the index limitations.

Yes, thank you. What do you think about the feature I proposed ?
From my understanding, this should not break the way Fauna handles indexes, just add a new nice way of defining them instead of bindings.

This would add the possibility to define a logic between terms and values, and non cartesian-product use cases. While still being operated on a single document, with pure lambdas, and so would not change the fact that an index is indeed a materlized view.

Would love to see this happen. Fauna ease of use depends massively on indexes.

What do you think about the feature I proposed?

I don’t think I understand what it is that you are asking for. Perhaps you can elaborate.

Let me provide some detail about how indexes work, based on the points you made, and then you can suggest how your idea might improve the situation.

You can already fetch an index’s terms or values definitions:

> {
  terms: Select("terms", Get(Index("people_sort_by_letter_asc"))),
  values: Select("values", Get(Index("people_sort_by_letter_asc"))),
}
{
  terms: [ { field: [ 'ref' ] } ],
  values: [ { field: [ 'data', 'letter' ] }, { field: [ 'ref' ] } ]
}

just add a new nice way of defining them instead of bindings

A binding is a Lambda, I’m not seeing the difference that you’re thinking of.

should not break the way Fauna handles indexes,

Any change to the way indexes currently work would break index handling for someone.

Bindings execute when index entries are created or updated, not when they are scanned for matches. To do the latter would be a significant performance issue.

When using an index, there are two “phases”:

  1. Matching, where the tuple provided to Match is compared against index entries’ terms.
  2. Results, where the values tuples are returned for matching index entries.

At no point does Match execute an index’s bindings. Even if it did, the only values available are those stored in the index entry, the terms and the values. The original document is not available, and the Cartesian product side-effect of indexing array fields would remain.

Hello (again) Ewan, and thank you.

I think I did not explained my suggestion right. I understood the way indexes work (or at least the big picture as you explained it to me). I still think my idea or something similar could solve some issues.

I created a really small dataset, and a python script to reproduce it here : GitHub - MaximeWeyl/fauna_example

It contains houses (3 of them right now) :

{
  "ref": Ref(Collection("houses"), "333844715967348800"),
  "ts": 1654637981290000,
  "data": {
    "street": "Time Square",
    "city": "New York",
    "rooms": [
      {
        "name": "Kitchen",
        "area": 20,
        "wall_color": "white"
      },
      {
        "name": "Living Room",
        "area": 40,
        "wall_color": "yellow"
      },
      {
        "name": "Bedroom",
        "area": 30,
        "wall_color": "red"
      }
    ]
  }
}

  "ref": Ref(Collection("houses"), "333844716155043904"),
  "ts": 1654637981470000,
  "data": {
    "street": "Rue de la paix",
    "city": "Paris",
    "rooms": [
      {
        "name": "Kitchen",
        "area": 10,
        "wall_color": "white"
      },
      {
        "name": "Living Room",
        "area": 20,
        "wall_color": "yellow"
      },
      {
        "name": "Bedroom",
        "area": 10,
        "wall_color": "red"
      },
      {
        "name": "Bedroom",
        "area": 10,
        "wall_color": "green"
      }
    ]
  }
}
{
  "ref": Ref(Collection("houses"), "333844716312330304"),
  "ts": 1654637981620000,
  "data": {
    "street": "Alexanderplatz",
    "city": "Berlin",
    "rooms": [
      {
        "name": "Kitchen",
        "area": 30,
        "wall_color": "white"
      },
      {
        "name": "Living Room",
        "area": 40,
        "wall_color": "yellow"
      },
      {
        "name": "Bedroom",
        "area": 10,
        "wall_color": "orange"
      },
      {
        "name": "Bedroom",
        "area": 10,
        "wall_color": "green"
      },
      {
        "name": "Bedroom",
        "area": 12,
        "wall_color": "white"
      }
    ]
  }
}

Initial app

Let say this model was implemented some time ago, and the app made the clients happy.
They are able to answer questions like (indexes exist in my github repo to create the example):

:question: Show me the houses in city X

"terms": [{"field": ["data", "city"]}]

:question: Give me the (sorted) list of cities where houses are located

"values": [{"field": ["data", "city"]}]

:question: Give me the (sorted) list of cities that have houses with at least a room of color X

"values": [{"field": ["data", "city"]}],
"terms": [{"field": ["data", "rooms", "wall_color"]}

:question: Give me the (sorted) list of cities that have houses with at least a room of color X

"values": [{"field": ["data", "city"]}],
"terms": [{"field": ["data", "rooms", "wall_color"]}

How an index is defined

An index definition is a function f(document) → list< (terms, values) >. The current definition let the user define a g(document) → ( list< terms > , list< values > ) but there is a cartesian product cp( ( list< terms >, list< values > ) → list< ( terms, values ) > that is chained in the background, and really what’s stored is a list< (terms, values) > (or something similar). You define g, knowing that cp will be chained to get result f.

As you said, this f function is applied when a document is created or updated, not at Match time. This is why it takes only the document as input and must be pure (no other reads, no writes, no side effects).

Using bindings or not does not change the signature of g, this is just a way of writing it that allows to define g that does not just access properties by name. Non bindings definitions (“fields”) are just some shortcut for a binding that would be a simple “SelectAll” function applied.

New enhancement wanted from customers

This is all great, the app is in production. The customers use it more and more, and they begin to have new questions. Most of them we can just build the appropriate index, and answer them, by defining the right function g, that will be chained with cp, to make the f function.

But for some questions, I cannot define the right g function.

For instance :
:question: Give me the list of colors that are used for rooms of type X (for instance X=Kitchen)

For this, you just want an index that would store the following (this is some notation I made up for showing what the index stores, you should understand it quite easily) :

[
  {
    terms: ["Kitchen"],
    values: [ ["white"]]
  },
  {
    terms: ["Living Room"],
    values: [ ["yellow"] ]
  },
  {
    terms: ["Bedroom"],
    values: [ ["green"], ["orange"], ["red"], ["white"] ]
  }
]

This is the aggregation of the results of f, applied 3 times (one for each document) :

f(doc1) =  [
  {terms: ["Kitchen"]}, values: ["white"]},
  {terms: ["Living Room"]}, values: ["yellow"]},
  {terms: ["Bedroom"]}, values: ["red"]},
]


f(doc2) =  [
  {terms: ["Kitchen"]}, values: ["white"]},
  {terms: ["Living Room"]}, values: ["yellow"]},
  {terms: ["Bedroom"]}, values: ["red"]},
  {terms: ["Bedroom"]}, values: ["green"]},
]


f(doc3) =  [
  {terms: ["Kitchen"]}, values: ["white"]},
  {terms: ["Living Room"]}, values: ["yellow"]},
  {terms: ["Bedroom"]}, values: ["orange"]},
  {terms: ["Bedroom"]}, values: ["green"]},
  {terms: ["Bedroom"]}, values: ["white"]},
]

This is not doable (I think) with the current collection as it is. You can define any function g, it will always be chained with cp to make f. So the result of f (what’s stored by the index) will always be a cartesian product. The results of f I want above are simply not a cartesian product, so it is impossible to define the good g that will make this f when chained with cp.

We might have done two collections : one for houses, one for rooms, and link them with references : that would have solved the problem (because we will apply f not for each house, but for each room, and then this aggregation is doable with cartesian products).

But the app already existed with this first schema, and indexes are already using it for the app in production. Migrating the schema (and the queries) would be possible, but not easy (especially if you do not want down time), but I think it would be better to first be able to make such an index without migrating the hole schema : implement first, optimize later if necessary.

My idea for making this happen, is to add a new way of defining an index. It would not change how current index work, but be an alternative (could be exclusive, just the standard way, or just the new way, not both).

It would work by letting the user define f directly. This results in f not necessarily being a cartesian product anymore. I can design the f I wanted for this question quite easily :

    Lambda( 
      "house", 
      Map(
        Select(["data", "rooms"], Var("house")), 
        Lambda(
          "room",
          {
            terms: [Select("name", Var("room"))],
            values: [Select("wall_color", Var("room"))]
          }
        )
      )
    )

with it, I have indeed :

  f(doc1) = [
    {
      terms: ["Kitchen"],
      values: ["white"]
    },
    {
      terms: ["Living Room"],
      values: ["yellow"]
    },
    {
      terms: ["Bedroom"],
      values: ["red"]
    }
  ]

f(doc2) = [
    {
      terms: ["Kitchen"],
      values: ["white"]
    },
    {
      terms: ["Living Room"],
      values: ["yellow"]
    },
    {
      terms: ["Bedroom"],
      values: ["red"]
    },
    {
      terms: ["Bedroom"],
      values: ["green"]
    }
  ]
 f(doc3) = [
    {
      terms: ["Kitchen"],
      values: ["white"]
    },
    {
      terms: ["Living Room"],
      values: ["yellow"]
    },
    {
      terms: ["Bedroom"],
      values: ["orange"]
    },
    {
      terms: ["Bedroom"],
      values: ["green"]
    },
    {
      terms: ["Bedroom"],
      values: ["white"]
    }
  ]

which is exactly what I want.

Shell command to get this result
Let({
  houses: Select("data", Map(
    Paginate(Match(Index("all_houses"))),
    Lambda("X", Get(Var("X")))
  ))}, 
  
  Map(
    Var("houses"), 
    
    Lambda( 
      "house", 
      Map(
        Select(["data", "rooms"], Var("house")), 
        Lambda(
          "room",
          {
            terms: [Select("name", Var("room"))],
            values: [Select("wall_color", Var("room"))]
          }
        )
      )
    )
    
    
  )

)

How to define this new way

The CreateIndex function could be extanded to accept a new syntax :

CreateIndex({
  name: "rooms_colors_by_names",
  source: Collection("houses"),
  terms_number: 1,
  values_number: 1,
  iterator_callback: Query(
    Lambda( 
      "house", 
       Map(
         Select(["data", "rooms"], Var("house")), 
         Lambda(
         "room",
         {
           terms: [Select("name", Var("room"))],
           values: [Select("wall_color", Var("room"))]
         }))
))})

Terms and values numbers are specified too, because unlike the original system, we cannot determine with certainty how many terms and values the lambda will return per reccord (here I call reccord one of the several (terms, values) pair returned by f), not even know if each record will have the same number of values and terms. I think this is needed at Index creation, so it is specified here. It could result in serveral implentation strategies :

  1. Records with uncorrect number of terms or values are ignored/discarded (not inserted in the index)
  2. A record with uncorrect number of terms or values raises an error : the document creation/update is rejected and the error is handled by the application making the query
  3. A record with uncorrect number of terms or values is extended or trimmed : missing terms or values are completed with null, extra terms or values are dropped.

Which strategy is the best, I do not know. Could be just one, or let the user chose it at Index creation time. Personnaly I think I would design my f correctly, so it always returns the right number of terms, values, so I never use any strategy (and so I would prefer the strategy 2, which makes it obvious that I failed at writing my f correctly).

I hope my idea is clearer now :slight_smile:
It is inspired by my couchdb memories (which defines another “reduce” function that could be really nice also, but this is another topic).

Give me the (sorted) list of cities that have houses with at least a room of color X

"values": [{"field": ["data", "city"]}],
"terms": [{"field": ["data", "rooms", "wall_color"]}

That syntax does not work. The wall_color field is a not a member of the rooms array.

An index definition is a function f(document) → list< (terms, values) >

If that’s how you want to think of how indexes work, just remember that f is not user-definable.

but there is a cartesian product cp( ( list< terms >, list< values > ) → list< ( terms, values ) >

This is true only when you attempt to index one or more array fields. Indexing a scalar field does not produce a Cartesian product of index entries.

For this example that you provided:

{
    terms: ["Bedroom"],
    values: [ ["green"], ["orange"], ["red"], ["white"] ]
}

That will not work. The terms and values definition must specify either a document field or a binding, and not a value to be indexed. This is because of Fauna’s temporality feature. Every document has versions, and the version to index is dependent on the transaction timestamp. Normally, the transaction timestamp is the current time (e.g. the result from Now()), but it can be specified with At().

This is not doable (I think) with the current collection as it is.

Correct. At this point, you need to split the documents in the single collection into related documents in multiple collections. Then it is much easier to produce the union or intersection of the document traits that you’re interested in.

Hello Ewan,
Thank you.
I will try to answer you on the different points you made :

Point 1

I thought it would not work for that particular reason, but believe me it does. I think it works more or less like the SelectAll function. As I said, those examples are tested in the github repo I linked above (but thats not really the point here :slight_smile: ).

image

Point 2

Yes, I agree, thats exactly my point :slight_smile: : now you have to define g. And my proposition is to allow defining f directly (as an alternative to defining g, which could still be the usual way of defining indexes).

I’m sure I did not catch every aspect of how indexes work in this post, but I tried to come up with a notation that explains what the problem is, and what a different approach could be.

Point 3

Yes I agree one more time. But I just think of a scalar value as a singleton list, containing just this scalar value. The result is the same as you’d expect in your sentence. That’s why I still call it a cartesian product.

Point 4

Here is the main point. Here I was not trying to give a description of how to define the index, but to notate what the index stores. This notation exists nowhere in Fauna, I tried to make this up, but I should have explained it more.
Here with this notation I was trying to say that the expected index, should have this behaviour when queried :

Paginate(Match(Index("index_name"), ["Bedroom"]))

should return

{
  data: [
    "green",
    "orange",
    "red",
    "white"
  ]
}

Another example, this notation :

{
    terms: ["X", "Y"],
    values: [ ["A1", "A2"], ["B1", "B2"] ]
}

would mean that

Paginate(Match(Index("index_name"), ["X", "Y"]))

should return

{
  data: [
    ["A1", "A2"],
    ["B1", "B2"],
  ]
}

Some more explaination

I used this notation to explain what I would want the index to store. As you said, this is not possible with the current way of defining indexes. My suggestion is to add another way of defining the index, that makes it possible to have those values stored, while still operating on a single document, with no side effects, just like the indexes require : this is just by giving the CreateIndex function the function f (instead of g), and by telling him how many terms and values it is expected to return.

I suggest you read my previous message with all this new explainations in mind, that should make more sense now :slight_smile: This feature, if accepted, would open a hole new range of possibilities for index creation, without (I believe) changing anything about how indexes are stored or queried, only by adding a new way of defining (and writing) them.

My suggestion is to add another way of defining the index, that makes it possible to have those values stored, while still operating on a single document, with no side effects, just like the indexes require.

The “problem” of not storing sub-documents in indexes is not a syntax problem. Fauna index entries store only scalar values, not objects. Making such a change to the way index entries are stored could have many unintended consequences for existing indexes.

Providing a function that defines the structure of an index entry would, indeed, be a powerful capability. Unfortunately, you would also be required to implement a comparator function, otherwise there would not be a reasonable way to sort such an index. There might not be a common answer to the question “should this document come before that document?”, especially when there are no guarantees that two documents shared any common structure.

And you would likely need to provide a uniqueness function, since you would likely want to define how much of a sub-document’s structure needed to be evaluated to determine the “value” used to determine uniqueness. Open-ended object evaluation would certainly perform poorly. Fauna’s implementation of objects is fairly stable in terms of field order, but that order can change when fields are added or removed. Any comparison of objects for sorting/uniqueness that depends on the “declared” field order would, at some point, produce unexpected results due to document modifications.

Fauna also tries to be as performant as possible. Allowing UDFs to define the most important operations in an index could be a source of any number of performance issues. I’m not saying Fauna would never make these changes, but the impacts of your suggestions are not obvious and would require significant analysis.

In the meantime, solving your immediate problem should be fairly straightforward with related documents in separate collections.

I agree with the problems of storing non-scalar values, but in my suggestion I did not speak about storing documents. The values I want stored are scalar values, in the exact same form than today, just defined with a different (more powerful) method which do not force you to have cartesian products.

(but I agree that my UDF could return non-scalar values. The non scalar values should be ignored, or raise errors (so the “strategies” I define above should be extended. I only considered the case where they are to many or too few terms (or values) returned. It should also handle the cases where they are non-scalar values as well). (but doing so could result in more reads than I would want).

I know I can do this particular use case by changing the schema. I’m not worried about not being able to do it, i’m more excited about what my suggestion could bring to the product.