I’d like to use the above partial matching pattern within a Union so that the returned items are either a partial match of the search term on title or an exact match on the searchItemsByTag index, which is an array (actually, I’d like to do partial matching on the tags too, but one thing at a time):
q.Map(
q.Paginate(
q.Union(
q.Match(q.Index("searchItemsByTag"), searchTerm),
// Use above partial matching query here. But how?!
)
),
q.Lambda((ref) => q.Get(ref))
)
I’m getting hung up on where to put Filter and so far, after many attempts, I’ve not been able to figure this out.
The most simple thing would probably be to fetch all of the items by tag, and then filter them. This will fetch SIZE documents and then filter and return those that match. The Filter will only be run on at most SIZE documents, so you can easily predict performance and cost.
q.Filter(
q.Map(
q.Paginate(q.Match(q.Index("searchItemsByTag"), TAG), { size: SIZE }),
ref => Get(ref)
),
doc => q.ContainsStr(
// I recommend Casefold on both the stored data and your search term
q.Casefold(q.Select(["data", "title"], doc)),
q.Casefold(PARTIAL_TITLE)
)
)
Filter then Paginate
You can put the Filter inside of Paginate, but in this case, Filter will continue to run against as many documents in the Set as necessary to fulfill SIZE, potentially the entire Set. You get more predictable number of results, but less predictable performance and cost. In the worst case, a query with just a couple of results could cost as many read ops as the total documents in the collection. But this may be fine in the case that the Set you are filtering is small (in this case, the Set is q.Match(q.Index("searchItemsByTag"), TAG)).
Thanks, @ptpaterson, for that detailed and thoughtful reply.
It seems that my initial approach touches upon the issue of potentially excessive operations, in that performing a search across a collection with a lot documents results in a “scan” of the entire collection.
And it seems that this would be the case even without using union to combine searches.
For example, if I only wanted to do a partial search on title, it follows that every document would need to be read to see whether or not there is a title match.
The thing is that this is pretty much what I want: to search the entire collection.
Let me setup a scenario:
The collection is Recipes with the following fields: id, title, cuisines, tags. id and title are strings, while cuisines and tags are arrays of strings.
I can’t imagine there ever being more that a few thousand documents in Recipes.
As a user of the app, I’d like to enter a search term and find all recipes that contain the search term in the title or the cuisines array or the tags array. No fuzzy logic, just partial matching.
So if I enter “pot”, for example, I could get a recipes that match the title, like “Potato Galette” or “Pot Brownies” or ones that match the tags like “Instant Pot.”
That’s the goal, anyhow.
So, if I understand correctly, a single (non Union) search on an index of “searchRecipesByTitle”, when the collection has 5000 documents, would result in 5000 reads?
Further, if I did a Union that also included searching searchRecipesByCuisine and searchRecipesByTag, that would result in 15,000 reads?
Or do I have that all wrong?
As a user, I don’t want to enter more that one search term, so reducing by tag first doesn’t really work in this scenario, I think.
Further, if I did a Union that also included searching searchRecipesByCuisine and searchRecipesByTag, that would result in 15,000 reads?
A Union of three sets with 5000 entries each would indeed be a new set with 15,000 entries.
But the expectation of using an Index is that it only returns what matches. If there are zero matches, for example, then you pay the cost of reading the index, but have nothing to iterate over so that’s the only cost. Best case, reading a Union of three sets where none have any entries costs 3 read ops, regardless of how many documents are in the indexes’ source collections.
If you cannot use an index, you can stick to just scanning the collection once, and combine the predicates in Filter.
Regarding indexes with bindings, consider if whole word search is okay. Or perhaps your app can try a more restrictive Index, and fall back on the full scan.
Try exact match (indexes, no bindings)
if not enough matches, try whole word search (indexes with bindings)
if not enough matches, start paginating through the whole collection and filtering
You might also optimize the filtering by using an Index that provides the search fields as values. This will avoid the cost of reading each document as you filter through them.
Note in the last example, the Get call is after Filter.
As a user, I don’t want to enter more that one search term, so reducing by tag first doesn’t really work in this scenario, I think.
Right! The example I gave more for something like tag = "pot" AND title LIKE "something else". The examples above should do a better job of something like
tag LIKE "pot" OR title LIKE "pot" OR cuisine LIKE "pot"
While the above works, I’m still a bit thick as to whether or not using a collection index is any more performant/economical than querying the collection. My guess is that it’s not, but I’d like your thought on that, if you’d be so kind.
That said, I like the second example you gave, with a new index that uses the search fields (and the ref) as values.
What I see this does is create a document for the various combinations of values based on existing data in the collection.
What I can’t see is if/when the index gets rebuilt when new data is added or existing data is updated. The docs seem to indicate that indexes do not get rebuilt. This means that I’d need to delete and re-create the index on any new cuisine, tag, or recipe and any related mutations (e.g. adding/removing a new tag to/from a recipe). This seems infeasible, .esp when the app is built around the idea of creating recipes. Any explanation or examples of how people deal with this would be much appreciated.
Finally, ignoring feasibility, while the above works and seems “snappier”, it does result in dupes. For example if title contains “pot” and tag contains “pot”.
I have tried placing Distinct in various spots in the query, but I can’t seem the right spot.
Indexes are updated transactionally along with their source collection(s). If a document is successfully written to, any affected index is guaranteed to be written.
Please let us know if there is something particular in the docs that gives the wrong impression, so we can review and update as necessary.
Collections are only accessible through an Index. When you use Documents(Collection("some_collection")) you are using a built-in Index. There is no difference between reading the built-in index, and reading your own “all_” index. Note that creating your own uses storage and adds an additional index that must be written to every time a document is created or deleted.
Oh, ok. That does complicate things. If you set tags and cuisines to be the index’s values, then you will get an entry for each item in the array. In fact, if you include both values, you will get a cartesian product of both arrays. This can get quickly out of hand, so be mindful that writing to this index can take more time and cost more if the arrays contain many values.
In order to get the one, distinct Ref from the results, you need to get an array or a set that is just the Ref, otherwise the other values make each entry distinct, even through the Ref is the same for each one.
q.Map(
q.Distinct(
// same query as before, but don't Get the document, yet
q.Map(
q.Filter(
q.Paginate(q.Match(q.Index("all_recipes__tag_asc_title_asc_cuisine_asc")), { size: SIZE }),
(tag, title, cuisine, ref) => q.Or(
q.ContainsStr(
q.Casefold(tag),
q.Casefold(PARTIAL_TAG)
),
q.ContainsStr(
q.Casefold(title),
q.Casefold(PARTIAL_TITLE)
),
q.ContainsStr(
q.Casefold(cuisine),
q.Casefold(PARTIAL_CUISINE)
),
)
),
(tag, title, cuisine, ref) => ref // don't Get yet
)
),
ref => Get(ref)
)
side note
You cannot console.log in FQL. Using the Javascript arrow functions is a convenience for using the Lambda FQL function. Fauna does not actually execute javascript code.
Regarding the documentation on indexes mentioned above, theres a section titled Modifying Index Documents here.
Being new to Fauna, and reading way too many documents, I totally misread that. Now I understand that while index properties can’t be modified, except for the name, the index does get updated as the data in the collection is modified.
On the occasions where the values do need to get updated, deleting and re-creating the index is trivial compared what I was imagining.
Since schema names can be modified, you can build a new index with a temporary name, and then swap the names so your application can start using the new one sooner. Due to our caching mechanism, you still need to wait about a minute to reuse a name. But consider if you have gigabytes of data, or complex bindings to evaluate, it could take hours to build a new index.
It’s taking me a while to get work to this to finality, as I’m trying to figure out how I can add sorting and another bugbear (trying to find a match where any of an array of search terms is contained in and index value field. The docs have me looking back at Union, but…), so stay tuned.
Or, if you prefer, we can close this, and I can back to you on a new thread.
If you are still trying to filter by partial tag & title etc., then I think here would be a good place since the partial string matches are what continue to make this so complicated.
If you have questions on the technique in the cookbook that are not related to the previous discussion, a new ticket would be appropriate. But just use your best judgment
As I understand from the docs, the results will be sorted based on the values provided in the index, so the data should be sorted title ascending plus the rest of the values in order of their place in the index.
Now here’s where I’m stuck. If you notice, I’m passing an options object, which has the page size and the after cursor that’s been converted to an expression using new Expr(). This works as expected for the Get Recipes query. Things paginate, and I can load next/previous with the after/before cursors.
However, when I run the Search Recipes query, the results are less than expected.
I know this is related to page size, as I can hardcode that to 100 and see the desired 10 results using specific search criteria; however, if I leave this at 5, for example, I get 2 results for the same search criteria, not 5.
I have scoured the docs and the mentioned cookbooks, but couldn’t find anything that helped with this specific scenario, but I am still a noob with Fauna so… I’m back here.
I think I understand what’s going on: 2 of the items of the page (with a size of 5) match the search criteria. The other 3 don’t because of how the index is built with values.
I’m looking for a bit of advice here, as I still can’t get pagination to work as I want it to for certain queries. I’m writing this here, as it’s related to the original content of this thread.
In some scenarios, I’m using the resultant data in an “infinite scroll” or a list where there’s a “Show More” button, if there is an “after”, and this works well when size is set to a high value; however, when size is set to a low value, like 10 here, the query results can be less than 10, because some of the docs returned do not match the conditionals (searchChefId, isPublished).
At least, I believe that’s what’s happening. Regardless, this makes the UI a little wonky. For example, 10 is a good size for some pages. I’m actually doing dynamic sizes based on screen height, but that’s beside the point. For now, let’s just say that 10 is the right amount of recipes to fill out the user’s view nicely. On infinite scroll pages, the observed element that triggers the next query is below the fold.
When the query returns only 3 items; however, those three recipes are displayed, and then the infinite scroll is triggered (as the observable is above the fold). On pages, where the “Show More” button is used, there’s just three recipes shown with the “Show More” button and then a bunch of blank space.
I thought about doing some recursive logic to make the data result consistent (i.e. keep querying while there’s an after, until the size is met), but I thought I’d ask an expert, who knows his stuff, before I do anything stupid.
You are using Filter on the results from Paginate, so there can be 0 or more results that match the criteria.
If you change the nesting to Paginate the results of Filter, then you should get full pages until the end of the set of matching documents is reached. However, calling Filter might exceed the transaction limits if you have too many matching documents.
It would be better to construct an index that met the Filter conditions.
I’m not sure what you mean by exceeding transaction limits. Do you mean that I could get a 429 (i.e. too many requests error), because of multiple queries in succession as it tries to get to 10? Would you be so kind to elaborate?
I’m not sure the above makes sense in this scenario, it seems like I’d need to create an index for each user (is that what you meant?).
That said, I would love to be able to create indexes, for other scenarios, that were “pre-populated” with data matching certain conditions. For example, an “all_published_recipes” index, that populates with any recipe (from the Recipe collection) where the value of data:isPublished is true.
I was scouring the docs for such a thing, and I came up short, so I didn’t know this was possible. I can’t seem to find where you can add conditional logic to an index. Can you please point me somewhere in the docs where I can read up on this?
I’m not sure what you mean by exceeding transaction limits. Do you mean that I could get a 429 (i.e. too many requests error), because of multiple queries in succession as it tries to get to 10? Would you be so kind to elaborate?
If you have millions of index entries, Filter would have to evaluate each of them. Doing so could exceed the available memory or time limit for the transaction, and an HTTP 429 error would occur and you would get no results.
I’m not sure the above makes sense in this scenario, it seems like I’d need to create an index for each user (is that what you meant?).
No.
Your query, as written, appears to perform the equivalent of a table scan to find chefId values that match some searchChefId value, and any index entries where isPublished is true. You could adjust your index definition to include chefId as a term, and let the index perform that matching for you.
As you can see, using the “table scan” strategy involves 13x read ops and 81x compute ops (in 9x the time) to get the same answer as using the index to do the searching for you.
I read up on the transactions limits you shared, and I can’t foresee exceeding them toward a 429, but I’ll keep this in mind.
Regarding the index, this thread is kinda old and @ptpaterson and I already had a discussion way up top, where he gave me some great insight on performance, which is line with your kind direction. I should have included that for clarity, mea culpa.
The index for this scenario is the below based on what I’ve learned toward avoiding full scans. All my indexes are built like this in hopes that I’m doing what’s most performant and cost efficient.
{
name: "search_recipes", // <= I know this should have a better name 😊
unique: false,
serialized: true,
source: "Recipe",
values: [
{
field: ["data", "title"]
},
{
field: ["data", "chefId"]
},
{
field: ["data", "cuisines"]
},
{
field: ["data", "tags"]
},
{
field: ["ref"]
}
]
}
Now I’m laughing because I imagine that you’re sharpening the pins for a voodoo doll at this point, but I gotta ask this…
I’ve managed to “fix” all my queries so that I’ve got pagination in the right spot, based on your helpful advice; however, the query that I run on the above index is a bit more tricky:
Your “more tricky” query continues to use the “table scan” strategy, which is necessary when you need to perform more than simple equivalence matching. But you could use an index to handle all of the direct equivalence matching, and then apply Filter for all of the “contains” matches. By doing that, you’re using indexes to reduce the number of index entries to consider. That would make your query more performant and cost less to execute.
The “contains” matches might be better modelled as many-to-many relations by using a “join” collection that identifies which chefs belong to which chef networks, and then you could use indexes to find those matches too. The same would apply to the cuisines.
For your query involving Distinct, it would be better if the Distinct call was inside the Paginate call, since you’d potentially be removing dupes only in a single page, rather than duplicates throughout the resulting set.