Combining Queries with Union and partial matching Filter

Hi there,

I’ve got a couple of queries that I’d like to combine, but I’m confusing myself.

The below snippets are using the faunadb client syntax.

The following will return documents with a partial match on title.

“all_items” is an index that’s been setup on a collection to return all items.

q.Map(
    q.Filter(
        q.Paginate(q.Match(q.Index("all_items"))),
        q.Lambda((ref) =>
            q.ContainsStr(
                q.LowerCase(
                    q.Select(["data", "title"], q.Get(ref))
                ),
                searchTerm  // <= this is a variable passed as a string
            )
        )
    ),
    q.Lambda((ref) => q.Get(ref))
)

The below will return exact matches against two indexes:

q.Map(
    q.Paginate(
        q.Union(
            q.Match(q.Index("searchItemsByTag"), searchTerm),
            q.Match(q.Index("searchItemsByTitle"), searchTerm)
        )
    ),
    q.Lambda((ref) => q.Get(ref))
)

I’d like to use the above partial matching pattern within a Union so that the returned items are either a partial match of the search term on title or an exact match on the searchItemsByTag index, which is an array (actually, I’d like to do partial matching on the tags too, but one thing at a time):

q.Map(
    q.Paginate(
        q.Union(
            q.Match(q.Index("searchItemsByTag"), searchTerm),
           // Use above partial matching query here.  But how?!
        )
    ),
    q.Lambda((ref) => q.Get(ref))
)

I’m getting hung up on where to put Filter and so far, after many attempts, I’ve not been able to figure this out.

I’m hopeful that there’s a solution out there.

Thanks in advance for any helpful advice.

Hi @Kimfucious and welcome! :wave:

Paginate then Filter

The most simple thing would probably be to fetch all of the items by tag, and then filter them. This will fetch SIZE documents and then filter and return those that match. The Filter will only be run on at most SIZE documents, so you can easily predict performance and cost.

q.Filter(
    q.Map(
        q.Paginate(q.Match(q.Index("searchItemsByTag"), TAG), { size: SIZE }),
        ref => Get(ref)
    ),
    doc => q.ContainsStr(
        // I recommend Casefold on both the stored data and your search term
        q.Casefold(q.Select(["data", "title"], doc)),
        q.Casefold(PARTIAL_TITLE)
    )
)

Filter then Paginate

You can put the Filter inside of Paginate, but in this case, Filter will continue to run against as many documents in the Set as necessary to fulfill SIZE, potentially the entire Set. You get more predictable number of results, but less predictable performance and cost. In the worst case, a query with just a couple of results could cost as many read ops as the total documents in the collection. But this may be fine in the case that the Set you are filtering is small (in this case, the Set is q.Match(q.Index("searchItemsByTag"), TAG)).

q.Map(
   q.Paginate(
       q.Filter(
           q.Match(q.Index("searchItemsByTag"), TAG),
           ref => q.ContainsStr(
               q.Casefold(q.Select(["data", "title"], Get(ref))),
               q.Casefold(PARTIAL_TITLE)
           )
       ),
       { size: SIZE }
   ),
   ref => Get(ref)
)

Use Index bindings

Here’s a previous topic that I hope can help: Search for substring ( need performant approach)

1 Like

Thanks, @ptpaterson, for that detailed and thoughtful reply.

It seems that my initial approach touches upon the issue of potentially excessive operations, in that performing a search across a collection with a lot documents results in a “scan” of the entire collection.

And it seems that this would be the case even without using union to combine searches.

For example, if I only wanted to do a partial search on title, it follows that every document would need to be read to see whether or not there is a title match.

The thing is that this is pretty much what I want: to search the entire collection.

Let me setup a scenario:

The collection is Recipes with the following fields: id, title, cuisines, tags.
id and title are strings, while cuisines and tags are arrays of strings.

I can’t imagine there ever being more that a few thousand documents in Recipes.

As a user of the app, I’d like to enter a search term and find all recipes that contain the search term in the title or the cuisines array or the tags array. No fuzzy logic, just partial matching.

So if I enter “pot”, for example, I could get a recipes that match the title, like “Potato Galette” or “Pot Brownies” or ones that match the tags like “Instant Pot.”

That’s the goal, anyhow.

So, if I understand correctly, a single (non Union) search on an index of “searchRecipesByTitle”, when the collection has 5000 documents, would result in 5000 reads?

Further, if I did a Union that also included searching searchRecipesByCuisine and searchRecipesByTag, that would result in 15,000 reads?

Or do I have that all wrong?

As a user, I don’t want to enter more that one search term, so reducing by tag first doesn’t really work in this scenario, I think.

Further, if I did a Union that also included searching searchRecipesByCuisine and searchRecipesByTag, that would result in 15,000 reads?

A Union of three sets with 5000 entries each would indeed be a new set with 15,000 entries.

But the expectation of using an Index is that it only returns what matches. If there are zero matches, for example, then you pay the cost of reading the index, but have nothing to iterate over so that’s the only cost. Best case, reading a Union of three sets where none have any entries costs 3 read ops, regardless of how many documents are in the indexes’ source collections.

If you cannot use an index, you can stick to just scanning the collection once, and combine the predicates in Filter.

q.Filter(
    q.Map(
        q.Paginate(q.Documents(q.Collection("Recipes")), { size: SIZE }),
        ref => Get(ref)
    ),
    doc => q.Or(
        q.ContainsStr(
            q.Casefold(q.Select(["data", "tag"], doc)),
            q.Casefold(PARTIAL_TAG)
        ),
        q.ContainsStr(
            q.Casefold(q.Select(["data", "title"], doc)),
            q.Casefold(PARTIAL_TITLE)
        ),
        q.ContainsStr(
            q.Casefold(q.Select(["data", "cuisine"], doc)),
            q.Casefold(PARTIAL_CUISINE)
        ),
    )
)

Regarding indexes with bindings, consider if whole word search is okay. Or perhaps your app can try a more restrictive Index, and fall back on the full scan.

  • Try exact match (indexes, no bindings)
  • if not enough matches, try whole word search (indexes with bindings)
  • if not enough matches, start paginating through the whole collection and filtering

You might also optimize the filtering by using an Index that provides the search fields as values. This will avoid the cost of reading each document as you filter through them.

CreateIndex({
  name: "all_recipes__tag_asc_title_asc_cuisine_asc",
  source: Collection("Recipes"),
  values: [
    { field: ["data", "tag"] },
    { field: ["data", "title"] },
    { field: ["data", "cuisine"] },
    { field: "ref" }
  ]
})
q.Map(
    q.Filter(
        q.Paginate(q.Match(q.Index("all_recipes__tag_asc_title_asc_cuisine_asc")), { size: SIZE }),
        (tag, title, cuisine, ref) => q.Or(
            q.ContainsStr(
                q.Casefold(tag),
                q.Casefold(PARTIAL_TAG)
            ),
            q.ContainsStr(
                q.Casefold(title),
                q.Casefold(PARTIAL_TITLE)
            ),
            q.ContainsStr(
                q.Casefold(cuisine),
                q.Casefold(PARTIAL_CUISINE)
            ),
       )
    ),
    (tag, title, cuisine, ref) => q.Get(ref)
)

Note in the last example, the Get call is after Filter.

As a user, I don’t want to enter more that one search term, so reducing by tag first doesn’t really work in this scenario, I think.

Right! The example I gave more for something like tag = "pot" AND title LIKE "something else". The examples above should do a better job of something like

tag LIKE "pot" OR title LIKE "pot" OR cuisine LIKE "pot"
1 Like

Thanks again for another excellent reply, @ptpaterson. Your taking the time to respond with such detail is much appreciated.

Using your first example, I got the below working, querying a collection index, all_recipes, rather than a collection.

As cuisines and tags are arrays, I’ve used ContainsValue, which didn’t like Casefold, so I just used str.toLowerCase() on str to get lc.

q.Filter(
    q.Map(
        q.Paginate(q.Match(q.Index("all_recipes")), { size: SIZE }),
        ref => q.Get(ref)
    ),
    doc => q.Or(
        q.ContainsStr(
            q.Casefold(q.Select(["data", "title"], doc)),
            q.Casefold(str)
        ),
        q.ContainsValue(
            lc,
            q.Select(["data", "tags"], doc),
        ),
        q.ContainsValue(
            lc,
            q.Select(["data", "cuisines"], doc),
        ),
    )
)

While the above works, I’m still a bit thick as to whether or not using a collection index is any more performant/economical than querying the collection. My guess is that it’s not, but I’d like your thought on that, if you’d be so kind.

That said, I like the second example you gave, with a new index that uses the search fields (and the ref) as values.

What I see this does is create a document for the various combinations of values based on existing data in the collection.

What I can’t see is if/when the index gets rebuilt when new data is added or existing data is updated. The docs seem to indicate that indexes do not get rebuilt. This means that I’d need to delete and re-create the index on any new cuisine, tag, or recipe and any related mutations (e.g. adding/removing a new tag to/from a recipe). This seems infeasible, .esp when the app is built around the idea of creating recipes. Any explanation or examples of how people deal with this would be much appreciated.

q.Map(
    q.Filter(
        q.Paginate(
            ( q.Match(q.Index("searchRecipesByCuisineTagTitle")) ),
            { size: SIZE }
        ),
        (cuisine, tag, title, ref) => {
            console.log(
                "cuisine, tag, title",
                cuisine,
                tag,
                title
            );
            return q.Or(
                q.ContainsStr(
                    q.Casefold(title),
                    q.Casefold(str)
                ),
                q.ContainsStr(q.Casefold(tag), q.Casefold(str)),
                q.ContainsStr(
                    q.Casefold(cuisine),
                    q.Casefold(str)
                )
            );
        }
    ),
    (tag, title, cuisine, ref) => q.Get(ref)
)

Finally, ignoring feasibility, while the above works and seems “snappier”, it does result in dupes. For example if title contains “pot” and tag contains “pot”.

I have tried placing Distinct in various spots in the query, but I can’t seem the right spot.

Indexes are updated transactionally along with their source collection(s). If a document is successfully written to, any affected index is guaranteed to be written.

Please let us know if there is something particular in the docs that gives the wrong impression, so we can review and update as necessary.

Collections are only accessible through an Index. When you use Documents(Collection("some_collection")) you are using a built-in Index. There is no difference between reading the built-in index, and reading your own “all_” index. Note that creating your own uses storage and adds an additional index that must be written to every time a document is created or deleted.

Oh, ok. That does complicate things. If you set tags and cuisines to be the index’s values, then you will get an entry for each item in the array. In fact, if you include both values, you will get a cartesian product of both arrays. This can get quickly out of hand, so be mindful that writing to this index can take more time and cost more if the arrays contain many values.

Consider the following example document:

{
  ref: Ref(Collection("things"), "101"),
  ts: 163405928345123,
  data: {
    a: [0, 1, 2, 3, 4],
    b: [5, 6, 7, 8, 9]
  }
}

With an index like

CreateIndex({
  name: "array_values_example",
  source: Collection("things"),
  values: [
    { field: ["data", "a"] },
    { field: ["data", "b"] },
    { field: "ref" }
  ]
})

Each value field is an array with 5 items, so there will be 5 * 5 = 25 entries for that one document!

Paginate(Match(Index("array_values_example")))

// output
{
  "data": [
    [0, 5, Ref(Collection("things"), "101")],
    [0, 6, Ref(Collection("things"), "101")],
    [0, 7, Ref(Collection("things"), "101")],
    [0, 8, Ref(Collection("things"), "101")],
    [0, 9, Ref(Collection("things"), "101")],
    [1, 5, Ref(Collection("things"), "101")],
    [1, 6, Ref(Collection("things"), "101")],
    [1, 7, Ref(Collection("things"), "101")],
    [1, 8, Ref(Collection("things"), "101")],
    [1, 9, Ref(Collection("things"), "101")],
    [2, 5, Ref(Collection("things"), "101")],
    [2, 6, Ref(Collection("things"), "101")],
    [2, 7, Ref(Collection("things"), "101")],
    [2, 8, Ref(Collection("things"), "101")],
    [2, 9, Ref(Collection("things"), "101")],
    [3, 5, Ref(Collection("things"), "101")],
    [3, 6, Ref(Collection("things"), "101")],
    [3, 7, Ref(Collection("things"), "101")],
    [3, 8, Ref(Collection("things"), "101")],
    [3, 9, Ref(Collection("things"), "101")],
    [4, 5, Ref(Collection("things"), "101")],
    [4, 6, Ref(Collection("things"), "101")],
    [4, 7, Ref(Collection("things"), "101")],
    [4, 8, Ref(Collection("things"), "101")],
    [4, 9, Ref(Collection("things"), "101")],
  ]
}

In order to get the one, distinct Ref from the results, you need to get an array or a set that is just the Ref, otherwise the other values make each entry distinct, even through the Ref is the same for each one.

q.Map(
  q.Distinct(
        // same query as before, but don't Get the document, yet
        q.Map(
            q.Filter(
                q.Paginate(q.Match(q.Index("all_recipes__tag_asc_title_asc_cuisine_asc")), { size: SIZE }),
                (tag, title, cuisine, ref) => q.Or(
                    q.ContainsStr(
                        q.Casefold(tag),
                        q.Casefold(PARTIAL_TAG)
                    ),
                    q.ContainsStr(
                        q.Casefold(title),
                        q.Casefold(PARTIAL_TITLE)
                    ),
                    q.ContainsStr(
                        q.Casefold(cuisine),
                        q.Casefold(PARTIAL_CUISINE)
                    ),
               )
            ),
            (tag, title, cuisine, ref) => ref // don't Get yet
        )
    ),
    ref => Get(ref)
)

side note

You cannot console.log in FQL. Using the Javascript arrow functions is a convenience for using the Lambda FQL function. Fauna does not actually execute javascript code.

Hi @ptpaterson,

Thanks for another awesome reply.

I went to try some things out and my Fauna db is gone!

It seems like you folks are having a major system event, so I’ll try soonest and report when I get to it.

Best,

Kim

Regarding the documentation on indexes mentioned above, theres a section titled Modifying Index Documents here.

Being new to Fauna, and reading way too many documents, I totally misread that. Now I understand that while index properties can’t be modified, except for the name, the index does get updated as the data in the collection is modified.

On the occasions where the values do need to get updated, deleting and re-creating the index is trivial compared what I was imagining.

Thank you for following up and clarifying!

Since schema names can be modified, you can build a new index with a temporary name, and then swap the names so your application can start using the new one sooner. Due to our caching mechanism, you still need to wait about a minute to reuse a name. But consider if you have gigabytes of data, or complex bindings to evaluate, it could take hours to build a new index.

Hi @ptpaterson,

It’s taking me a while to get work to this to finality, as I’m trying to figure out how I can add sorting and another bugbear (trying to find a match where any of an array of search terms is contained in and index value field. The docs have me looking back at Union, but…), so stay tuned.

Or, if you prefer, we can close this, and I can back to you on a new thread.

If you are still trying to filter by partial tag & title etc., then I think here would be a good place since the partial string matches are what continue to make this so complicated.

We have a cookbook example for matching an index with a list of terms: List of terms :: Fauna Documentation

If you have questions on the technique in the cookbook that are not related to the previous discussion, a new ticket would be appropriate. But just use your best judgment :slight_smile:

Thanks, @ptpaterson.

This is following the same pattern that we’ve discussed above, so I’m keeping it here for now.

In brief, I’ve created the following index:

{
  name: "recipes_by_chef_id",
  unique: false,
  serialized: true,
  source: "Recipe",
  values: [
    {
      field: ["data", "chefId"]
    },
    {
      field: ["ref"]
    }
  ]
}

This works great for the following query (and I probably don’t need Distinct here):

Get Recipes Query

q.Map(
    q.Distinct(
        q.Map(
            q.Filter(
                q.Paginate(
                    q.Match(q.Index("recipes_by_chef_id")),
                    options
                ),
                (chefId, ref) =>
                    q.Or(
                        q.Equals(chefId, myChefId),
                        q.ContainsValue(chefId, myChefNetwork)
                    )
            ),
            (chefId, ref) => ref
        )
    ),
    (ref) => q.Get(ref)
)

Taking this a step further, I’ve created an index specifically for search:

{
  name: "search_recipes",
  unique: false,
  serialized: true,
  source: "Recipe",
  values: [
    {
      field: ["data", "title"]
    },
    {
      field: ["data", "chefId"]
    },
    {
      field: ["data", "cuisines"]
    },
    {
      field: ["data", "tags"]
    },
    {
      field: ["ref"]
    }
  ]
}

Amazingly my head did not explode with the Ands and Ors to get this working:

Search Recipes Query

q.Map(
    q.Distinct(
        q.Map(
            q.Filter(
                q.Paginate(
                    q.Match(q.Index("search_recipes")),
                    options // <- {size: 5, after}
                ),
                (title, cuisines, tags, chefId, ref) =>
                    q.Or(
                        q.And(
                            q.Or(
                                q.Equals(chefId, myChefId),
                                q.ContainsValue(
                                    chefId,
                                    myChefNetwork
                                )
                            ),
                            q.Or(
                                q.ContainsStr(
                                    q.Casefold(title),
                                    q.Casefold(searchString)
                                )
                            )
                        ),
                        q.And(
                            q.Or(
                                q.Equals(chefId, myChefId),
                                q.ContainsValue(
                                    chefId,
                                    myChefNetwork
                                )
                            ),
                            q.Or(
                                q.ContainsStr(
                                    q.Casefold(cuisines),
                                    q.Casefold(searchString)
                                )
                            )
                        ),
                        q.And(
                            q.Or(
                                q.Equals(chefId, myChefId),
                                q.ContainsValue(
                                    chefId,
                                    myChefNetwork
                                )
                            ),
                            q.Or(
                                q.ContainsStr(
                                    q.Casefold(tags),
                                    q.Casefold(searchString)
                                )
                            )
                        )
                    )
            ),
            (title, cuisines, tags, chefId, ref) => ref // <- order matters here!
        )
    ),
    (ref) => q.Get(ref)
)

As I understand from the docs, the results will be sorted based on the values provided in the index, so the data should be sorted title ascending plus the rest of the values in order of their place in the index.

Now here’s where I’m stuck. If you notice, I’m passing an options object, which has the page size and the after cursor that’s been converted to an expression using new Expr(). This works as expected for the Get Recipes query. Things paginate, and I can load next/previous with the after/before cursors.

However, when I run the Search Recipes query, the results are less than expected.

I know this is related to page size, as I can hardcode that to 100 and see the desired 10 results using specific search criteria; however, if I leave this at 5, for example, I get 2 results for the same search criteria, not 5.

I have scoured the docs and the mentioned cookbooks, but couldn’t find anything that helped with this specific scenario, but I am still a noob with Fauna so… I’m back here.

Any advice would be greatly appreciated.

I think I understand what’s going on: 2 of the items of the page (with a size of 5) match the search criteria. The other 3 don’t because of how the index is built with values.

This is tricky…