Search for substring ( need performant approach)

Here is a sample document.

{
ref: Ref(Collection(“usersData”), “298447307533189637”),
ts: 1620880381996000,
data: {
product: “3in1”,
phoneDetails: {
isoCode: “IN”,
number: “0000”,
isWhatsAppNumber: true
},
email: “abc@email.com”,
}
}

I am working on implementing a search bar for my organization’s user dashboard.
If a person searches for ‘ab’ in the email search box, it should list all the users having this string. For this, I have also created an Index for the term ‘email’.
How do it in a simple way?

This has come up a few times, but it’s probably a good to gather the discussions together!

Forum topics

Stack Overflow

The single most comprehensive notes might be this post on stack overflow:

NGram in general

The performant searches come down to using NGram function. It’s not in the docs. My understanding is that it’s still unstable. Yet if you need to build up a performant sub-string search, this is how to do it.

Examples in the wild

The fwitter example goes through several steps to show how NGram can be used in different ways.

@eigilsagafos shared a gist (Fauna Ngram search example · GitHub) .

That gist is linked from the Awesome Fauna repo (GitHub - fauna-labs/awesome-fauna: Curated list of FaunaDB resources). I am not sure if any examples shared there implement searching.

If anyone else has specific examples to share, and their experience with how the implementations performed, that would be awesome! The Fauna ecosystem appears to be full of a lot of “this is a way you could do it”, but not much "I did do it this way, and this was my experience. Or at least, that is not as easy to discover.

I do not have direct experience with applying this to larger datasets and am also terribly curious myself, in addition to wanting to make this more discoverable.

Thank you @ptpaterson.
I have gone through these link prior to posting the question here. I am new to fauna and still struggling to understand fql and graphql. I am using c# driver and most of the examples are in javascript. So I am facing difficulty in understanding how these things work. I am familiar with MySQL and MSSQL.
Let me tell you my progress regarding the query and your suggestions are always welcome.
I have an index. Fairly simple, just like views in Mysql.
image

Code: For now I am searching mobile number, ie data.phoneDetails.number. For testing purposes, most of the data is “12345678”.
Map(
Filter(
Paginate(Match(Index(“all_users”))),
Lambda(
“userRef”,
ContainsStr(
LowerCase(Select([“data”, “phoneDetails”, “number”], Get(Var(“userRef”)))),
“123”
)
)
),
Lambda(“userRef”, Get(Var(“userRef”)))
)

After executing:

Here the issue is that readops value increases as documents increase(even though the search string is not contained in the document), causing a lot of billing. The same thing for large datasets can be done on MSSQL server with more predictable billing and in a cost-effective way (if done correctly).
We are currently working on large data that consists of unequal fields and thus NoSql becomes a solution in this case. But I can’t see how FaunDb is able to add value to the project.

There is definitely a learning curve for FQL. I feel your pain for trying to pick that up at the same time as GraphQL, and neither Fauna nor GraphQL likely feel very native to C#. Add to that, that the recommended way to do text search is to use an undocumented function…

One of the first things you are trying is search, which is probably one of the hardest things to get right. But I get that it’s absolutely critical to prove that you can do it for your application before becoming too invested.

I know you said you went through the links. But please take another look at the StackOverflow answer again. It begins by showing an example with Filter, similar to what you have done. But it goes on to recognize that this is absolutely not feasible to do for larger data sets. Indeed, it’s too expensive and also unpredictable to Paginate over several pages – for example, you might get different numbers of results for each page, since Filter is executed after retrieving each page. And the big thing, which you noticed, is that if you Paginate, and then Filter it costs read ops for each an every result.

The answer to this is writing your own Index, which use bindings in some way to provide useful search results. For an SQL WHERE clause, the database is going to try to be clever and plan out as efficient a query as possible before defaulting to scanning the entire table. Fauna, on the other hand, does exactly what you tell it to do. So when you told it to Paginate then Filter that’s what it did. Your query instructed Fauna to scan the whole collection.

This is good and bad. Bad, because it’s more work and learning for us building specific indexes and being careful with our queries. Good, because every query is deterministic – it’s always possible to estimate the cost of a query before-hand (if you also understand what your data looks like). It also means that it is harder to do something by accident. For example, the default Page size is 64, so you have to be explicit about requesting more. There is no query planner to decide to scan the entire table on your behalf.

I won’t rehash everything in the SO answer and other links. But I’ll try to hit the highlights.

executive summary

  1. setting array fields as an index term creates an index entry for each element of the array.
  2. Index bindings can compute new fields based on the Document’s data
  3. Index bindings can create array fields
  4. Different bindings can provide different kinds of search
  5. NGram can be used to create bindings for partial word search or fuzzy search
  6. Don’t give up on the examples.

Array fields

from the docs:

When a document has a field containing an array of items, and that field is indexed, Fauna creates distinct index entries for each item in the field’s array. That makes it easy to search for any item in the array.

bindings

You can create your own computed fields and index them using “bindings”. The docs have several different examples of how they can be used (not just in full-text/fuzzy search).

An index’s source field defines one or more collections that should be indexed. It is also used to define zero or more fields that have associated binding functions. The binding functions compute the value for the specified field while the document is being indexed.

Using bindings for search

Slightly simplified from the SO answer, a binding that lets you do a full-word search (case sensitive):

Here, the FindStrRegex function is key.

// C#
CreateIndex(Obj(
  "name", "tasks_by_words",
  "source", [
    Obj(
      "collection", Collection("tasks"),
      "fields", Obj(
        "words", Query(Lambda(
          "task_document",
          Map(
            FindStrRegex(Select(["data", "name"], Var("task_document"), "[^\\ ]+"),
            Lambda("result", LowerCase(Select(["data"], Var("result"))))
          )
        ))
      )
    )
  ],
  "terms", [
    Obj(
      "binding", "words"
    )
  ]
))

NGram is what you want for partial word (exact contains) search or fuzzy search

The SO answer covers using NGram for these two cases.

Don’t give up on the examples.

The Fwitter example is in javascript, but the file is almost entirely FQL. For C#, the required nested Obj functions make things tedious, but the FQL functions are otherwise directly translatable.

Thank you @ptpaterson. I will surely give it a shot.

@ptpaterson It says Obj() can’t take 3 arguments.

oops. Typos here:

image

I’ve updated my post.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.