Hello (again) Ewan, and thank you.
I think I did not explained my suggestion right. I understood the way indexes work (or at least the big picture as you explained it to me). I still think my idea or something similar could solve some issues.
I created a really small dataset, and a python script to reproduce it here : GitHub - MaximeWeyl/fauna_example
It contains houses (3 of them right now) :
{
"ref": Ref(Collection("houses"), "333844715967348800"),
"ts": 1654637981290000,
"data": {
"street": "Time Square",
"city": "New York",
"rooms": [
{
"name": "Kitchen",
"area": 20,
"wall_color": "white"
},
{
"name": "Living Room",
"area": 40,
"wall_color": "yellow"
},
{
"name": "Bedroom",
"area": 30,
"wall_color": "red"
}
]
}
}
"ref": Ref(Collection("houses"), "333844716155043904"),
"ts": 1654637981470000,
"data": {
"street": "Rue de la paix",
"city": "Paris",
"rooms": [
{
"name": "Kitchen",
"area": 10,
"wall_color": "white"
},
{
"name": "Living Room",
"area": 20,
"wall_color": "yellow"
},
{
"name": "Bedroom",
"area": 10,
"wall_color": "red"
},
{
"name": "Bedroom",
"area": 10,
"wall_color": "green"
}
]
}
}
{
"ref": Ref(Collection("houses"), "333844716312330304"),
"ts": 1654637981620000,
"data": {
"street": "Alexanderplatz",
"city": "Berlin",
"rooms": [
{
"name": "Kitchen",
"area": 30,
"wall_color": "white"
},
{
"name": "Living Room",
"area": 40,
"wall_color": "yellow"
},
{
"name": "Bedroom",
"area": 10,
"wall_color": "orange"
},
{
"name": "Bedroom",
"area": 10,
"wall_color": "green"
},
{
"name": "Bedroom",
"area": 12,
"wall_color": "white"
}
]
}
}
Initial app
Let say this model was implemented some time ago, and the app made the clients happy.
They are able to answer questions like (indexes exist in my github repo to create the example):
Show me the houses in city X
"terms": [{"field": ["data", "city"]}]
Give me the (sorted) list of cities where houses are located
"values": [{"field": ["data", "city"]}]
Give me the (sorted) list of cities that have houses with at least a room of color X
"values": [{"field": ["data", "city"]}],
"terms": [{"field": ["data", "rooms", "wall_color"]}
Give me the (sorted) list of cities that have houses with at least a room of color X
"values": [{"field": ["data", "city"]}],
"terms": [{"field": ["data", "rooms", "wall_color"]}
How an index is defined
An index definition is a function f(document) → list< (terms, values) >. The current definition let the user define a g(document) → ( list< terms > , list< values > ) but there is a cartesian product cp( ( list< terms >, list< values > ) → list< ( terms, values ) > that is chained in the background, and really what’s stored is a list< (terms, values) > (or something similar). You define g, knowing that cp will be chained to get result f.
As you said, this f function is applied when a document is created or updated, not at Match time. This is why it takes only the document as input and must be pure (no other reads, no writes, no side effects).
Using bindings or not does not change the signature of g, this is just a way of writing it that allows to define g that does not just access properties by name. Non bindings definitions (“fields”) are just some shortcut for a binding that would be a simple “SelectAll” function applied.
New enhancement wanted from customers
This is all great, the app is in production. The customers use it more and more, and they begin to have new questions. Most of them we can just build the appropriate index, and answer them, by defining the right function g, that will be chained with cp, to make the f function.
But for some questions, I cannot define the right g function.
For instance :
Give me the list of colors that are used for rooms of type X (for instance X=Kitchen)
For this, you just want an index that would store the following (this is some notation I made up for showing what the index stores, you should understand it quite easily) :
[
{
terms: ["Kitchen"],
values: [ ["white"]]
},
{
terms: ["Living Room"],
values: [ ["yellow"] ]
},
{
terms: ["Bedroom"],
values: [ ["green"], ["orange"], ["red"], ["white"] ]
}
]
This is the aggregation of the results of f, applied 3 times (one for each document) :
f(doc1) = [
{terms: ["Kitchen"]}, values: ["white"]},
{terms: ["Living Room"]}, values: ["yellow"]},
{terms: ["Bedroom"]}, values: ["red"]},
]
f(doc2) = [
{terms: ["Kitchen"]}, values: ["white"]},
{terms: ["Living Room"]}, values: ["yellow"]},
{terms: ["Bedroom"]}, values: ["red"]},
{terms: ["Bedroom"]}, values: ["green"]},
]
f(doc3) = [
{terms: ["Kitchen"]}, values: ["white"]},
{terms: ["Living Room"]}, values: ["yellow"]},
{terms: ["Bedroom"]}, values: ["orange"]},
{terms: ["Bedroom"]}, values: ["green"]},
{terms: ["Bedroom"]}, values: ["white"]},
]
This is not doable (I think) with the current collection as it is. You can define any function g, it will always be chained with cp to make f. So the result of f (what’s stored by the index) will always be a cartesian product. The results of f I want above are simply not a cartesian product, so it is impossible to define the good g that will make this f when chained with cp.
We might have done two collections : one for houses, one for rooms, and link them with references : that would have solved the problem (because we will apply f not for each house, but for each room, and then this aggregation is doable with cartesian products).
But the app already existed with this first schema, and indexes are already using it for the app in production. Migrating the schema (and the queries) would be possible, but not easy (especially if you do not want down time), but I think it would be better to first be able to make such an index without migrating the hole schema : implement first, optimize later if necessary.
My idea for making this happen, is to add a new way of defining an index. It would not change how current index work, but be an alternative (could be exclusive, just the standard way, or just the new way, not both).
It would work by letting the user define f directly. This results in f not necessarily being a cartesian product anymore. I can design the f I wanted for this question quite easily :
Lambda(
"house",
Map(
Select(["data", "rooms"], Var("house")),
Lambda(
"room",
{
terms: [Select("name", Var("room"))],
values: [Select("wall_color", Var("room"))]
}
)
)
)
with it, I have indeed :
f(doc1) = [
{
terms: ["Kitchen"],
values: ["white"]
},
{
terms: ["Living Room"],
values: ["yellow"]
},
{
terms: ["Bedroom"],
values: ["red"]
}
]
f(doc2) = [
{
terms: ["Kitchen"],
values: ["white"]
},
{
terms: ["Living Room"],
values: ["yellow"]
},
{
terms: ["Bedroom"],
values: ["red"]
},
{
terms: ["Bedroom"],
values: ["green"]
}
]
f(doc3) = [
{
terms: ["Kitchen"],
values: ["white"]
},
{
terms: ["Living Room"],
values: ["yellow"]
},
{
terms: ["Bedroom"],
values: ["orange"]
},
{
terms: ["Bedroom"],
values: ["green"]
},
{
terms: ["Bedroom"],
values: ["white"]
}
]
which is exactly what I want.
Shell command to get this result
Let({
houses: Select("data", Map(
Paginate(Match(Index("all_houses"))),
Lambda("X", Get(Var("X")))
))},
Map(
Var("houses"),
Lambda(
"house",
Map(
Select(["data", "rooms"], Var("house")),
Lambda(
"room",
{
terms: [Select("name", Var("room"))],
values: [Select("wall_color", Var("room"))]
}
)
)
)
)
)
How to define this new way
The CreateIndex function could be extanded to accept a new syntax :
CreateIndex({
name: "rooms_colors_by_names",
source: Collection("houses"),
terms_number: 1,
values_number: 1,
iterator_callback: Query(
Lambda(
"house",
Map(
Select(["data", "rooms"], Var("house")),
Lambda(
"room",
{
terms: [Select("name", Var("room"))],
values: [Select("wall_color", Var("room"))]
}))
))})
Terms and values numbers are specified too, because unlike the original system, we cannot determine with certainty how many terms and values the lambda will return per reccord (here I call reccord one of the several (terms, values) pair returned by f), not even know if each record will have the same number of values and terms. I think this is needed at Index creation, so it is specified here. It could result in serveral implentation strategies :
- Records with uncorrect number of terms or values are ignored/discarded (not inserted in the index)
- A record with uncorrect number of terms or values raises an error : the document creation/update is rejected and the error is handled by the application making the query
- A record with uncorrect number of terms or values is extended or trimmed : missing terms or values are completed with null, extra terms or values are dropped.
Which strategy is the best, I do not know. Could be just one, or let the user chose it at Index creation time. Personnaly I think I would design my f correctly, so it always returns the right number of terms, values, so I never use any strategy (and so I would prefer the strategy 2, which makes it obvious that I failed at writing my f correctly).
I hope my idea is clearer now
It is inspired by my couchdb memories (which defines another “reduce” function that could be really nice also, but this is another topic).