Full Text search with NGram (approach validation)

Hi there, I’m looking for help validating my approach.

I’m currently implementing a full-text search via NGram. Each row in the collection will have a field called description and the goal is to search the collection based on that field.

My index:

q.CreateIndex({
  name: 'transactionsByDescription',
  terms: [
    { binding: 'search' }
  ],
  source: {
    collection: q.Collection('transactions'),
    fields: {
      search: q.Query(
        q.Lambda(
          'transaction',
          q.NGram(
            q.LowerCase(
              q.Select(['data', 'description'], q.Var('transaction'))
            ),
            3,
            3
          )
        )
      )
    }
  }
});

Now, I’m trying to test what the query will be via the Shell on the dashboard. Currently, I only have one data on the transactions collection, the description is Testing Only. The user will provide a search string, in the example below, the search string is Test.

The NGram produced will be ["tes", "est", "sti", "tin", "ing"]

The way I do it is by splitting the search string into NGram of 3 as well, and then looping through each of the items and then run Match for that item.

Map(
  NGram(
    LowerCase('Testing'),
    3,
    3
  ),
  Lambda(
    'needle',
    Paginate(
      Match(
        Index('transactionsByDescription'),
        Var('needle')
      )
    )
  )
)

Since there were 5 NGrams produced, this will loop 5 times and actually result in an array of 5 repeating references:

[
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  }
]

Which is not the desirable result, I thought of doing

Distinct(
  Map(
    NGram(
      LowerCase('Testing'),
      3,
      3
    ),
    Lambda(
      'needle',
      Select(
        ['data'],
        Paginate(
          Match(
            Index('transactionsByDescription'),
            Var('needle')
          )
        )
      )
    )
  )
)

Which will now result in the following array:

[
  [Ref(Collection("transactions"), "278191047968817665")]
]

There are no repeating items which is the desired result. I think my approach is incorrect, particularly because the paginate is inside the lambda, and I actually want to implement pagination on this with size of 10, I think this approach will not work with that.

The reason why I’m splitting the search string into trigram is because if I don’t, I will have to do this:

Paginate(
  Match(
    Index('transactionsByDescription'),
    'tes'
  ),
  {
    size: 10
  }
)

This works really well, and the pagination is the top query which means there will be no problem, but that’s only going to work IF the search string is always 3 characters long because that satisfies the trigram.

{
  data: [Ref(Collection("transactions"), "278191047968817665")]
}

But if the search string is not 3 characters, like:

Paginate(
  Match(
    Index('transactionsByDescription'),
    'testing'
  ),
  {
    size: 10
  }
)

It will result to an empty array:

{
  data: []
}

Because the search string no longer satisfies the trigram approach

I found an answer:

Paginate(
  Intersection(
    Map(
      NGram(
        LowerCase('Testing'),
        3,
        3
      ),
      Lambda(
        'needle',
        Match(
          Index('transactionsByDescription'),
          Var('needle')
        )
      )
    )
  ),
  {
    size: 10
  }
)

Which will result in:

{
  data: [Ref(Collection("transactions"), "278191047968817665")]
}

Not sure if this is the correct approach though

Greate example @aprilmintacpineda , have you tried this type of search in collections with more than 30k documents? How about the performance? Does the cost of storage go up a lot?

I have a collection with more than 50k documents and I want to find a solution that allows me to balance cost and performance!!

Great question, haven’t had the chance to do that. Maybe you can try it out?

The problem I’m encountering with this approach is that the longer the query string is, the more results you’ll get and the less accurate the results are. Ideally, you’d want the result to be really accurate, this approach doesn’t seem to do that unless you’re willing to do some more filtering on the JS side which kinda suck.

This is what fuzzy search does: break a big search term into smaller terms and search for them. It’s definitely going to return more results.

One thing that can be done, although it can get intensive, is to check if there are results for the intersection of multiple terms.

example:

Given the search term, “big brown cow”, and a calculation that gets grams from the search term “big”, “brown”, and “cow”,

Let(
  {
    search_term: "big brown cow",
    // ... some calculations
    grams_in_search: ["big", "brown", "cow"],

    most_precise_search: Intersection(Map(
      Var("grams_in_search"),
      gram => Match(Index("my_index"), gram)
    )
  },
  If(
    Exists(Var("most_precise_search")),
    Paginate(Var("most_precise_search")),
    // try something else
  )
)

Of course, this could become extreme if your search term provided a large set of grams. Perhaps you can pick groups of 3 or 4 of them and test a few permutations before punting to the most general and fuzziest of searches. But this also gets out of hand quickly if you are using trigrams or other small grams (e.g. using ngram(string, 3, 3)), so maybe you only want to try this approach with whole words.

Work your way from most specific to most fuzzy

And anything with ngram is going to be on the fuzzy side, so don’t even reach for ngram until you try other things.

You might consider stacking your search with some order of president where you make one search, and if nothing comes up, then check for matches in a less precise search.

(the following is not intended to all encompassing, only an example)

  1. exact matches (for truly short string fields)
  2. intersection of whole words. All words must exist. This is an “AND” operation of whole words.
  3. intersection of some whole words, like maybe just the first 3 :person_shrugging:
  4. union of matches with each whole word. Any words must exist. This is an “OR” operation of whole words.
  5. intersection of some ngrams.
  6. union of ngrams

Each of these searches (and other ones you think of) will probably have different indexes, or maybe just use them differently (e.g. combined with Intersection vs. Union).

Additional advice for ngram

You want to limit the number of grams you generate and thus create index entries for. Here is a sequence of steps you can take to limit that. The same steps would be taken to break down your search term.

(I’m sure it can be improved further, and should definitely be looked at in the lense of one’s actual use case.)

  1. limit the size of the string, i.e. only search the first N characters:
truncated: SubString(Var("field"), 0, N)
  1. normalize the string
normalized: Casefold(Var("truncated"))
  1. break the search down into whole words
words: Map(
  FindStrRegex(Var("normalized"), "[^ ]+"),
  res => Select("data", res)
)
  1. limit the number of words to N
firstWords: Take(N, Var("words"))

Now, at this point, you can use this for indexing whole words. You should still do all of this if you want to continue to use ngram. This is because grams like "g b" and "n c" are arguably not as useful. So only create grams for the words that appear.

  1. create grams for each word
wordGrams: Map(
  Var("firstWords"), 
  word => NGram(word, 3, 3)
)
  1. combine distinct grams
wordGramsCombined: Distinct(Union(Var("wordGrams")))

Now put that all together

whole word search

CreateIndex({
  name: "word_search",
  source: {
    collection: Collection("things"),
    fields: {
      search: Query(
        Lambda(
          "doc",
          Let(
            {
              field: Select(["data", "field"], Var("doc")),
              truncated: SubString(Var("field"), 0, N),
              normalized: Casefold(Var("truncated")),
              words: Map(
                FindStrRegex(Var("normalized"), "[^ ]+"),
                res => Select("data", res)
              ),
              firstWords: Take(N, Var("words")),
            },
            Var("firstWords"),
          )
        )
      ),
    },
  },
  terms: [{ binding: "search" }]
})

ngram search

CreateIndex({
  name: "ngram_search",
  source: {
    collection: Collection("things"),
    fields: {
      search: Query(
        Lambda(
          "doc",
          Let(
            {
              field: Select(["data", "field"], Var("doc")),
              truncated: SubString(Var("field"), 0, N),
              normalized: Casefold(Var("truncated")),
              words: Map(
                FindStrRegex(Var("normalized"), "[^ ]+"),
                res => Select("data", res)
              ),
              firstWords: Take(N, Var("words")),
              wordGrams: Map(
                Var("firstWords"), 
                word => NGram(word, 3, 3)
              ),
              wordGramsCombined: Distinct(Union(Var("wordGrams")))
            },
            Var("wordGramsCombined"),
          )
        )
      ),
    },
  },
  terms: [{ binding: "search" }]
})
1 Like