Full Text search with NGram (approach validation)

Hi there, I’m looking for help validating my approach.

I’m currently implementing a full-text search via NGram. Each row in the collection will have a field called description and the goal is to search the collection based on that field.

My index:

q.CreateIndex({
  name: 'transactionsByDescription',
  terms: [
    { binding: 'search' }
  ],
  source: {
    collection: q.Collection('transactions'),
    fields: {
      search: q.Query(
        q.Lambda(
          'transaction',
          q.NGram(
            q.LowerCase(
              q.Select(['data', 'description'], q.Var('transaction'))
            ),
            3,
            3
          )
        )
      )
    }
  }
});

Now, I’m trying to test what the query will be via the Shell on the dashboard. Currently, I only have one data on the transactions collection, the description is Testing Only. The user will provide a search string, in the example below, the search string is Test.

The NGram produced will be ["tes", "est", "sti", "tin", "ing"]

The way I do it is by splitting the search string into NGram of 3 as well, and then looping through each of the items and then run Match for that item.

Map(
  NGram(
    LowerCase('Testing'),
    3,
    3
  ),
  Lambda(
    'needle',
    Paginate(
      Match(
        Index('transactionsByDescription'),
        Var('needle')
      )
    )
  )
)

Since there were 5 NGrams produced, this will loop 5 times and actually result in an array of 5 repeating references:

[
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  },
  {
    data: [Ref(Collection("transactions"), "278191047968817665")]
  }
]

Which is not the desirable result, I thought of doing

Distinct(
  Map(
    NGram(
      LowerCase('Testing'),
      3,
      3
    ),
    Lambda(
      'needle',
      Select(
        ['data'],
        Paginate(
          Match(
            Index('transactionsByDescription'),
            Var('needle')
          )
        )
      )
    )
  )
)

Which will now result in the following array:

[
  [Ref(Collection("transactions"), "278191047968817665")]
]

There are no repeating items which is the desired result. I think my approach is incorrect, particularly because the paginate is inside the lambda, and I actually want to implement pagination on this with size of 10, I think this approach will not work with that.

The reason why I’m splitting the search string into trigram is because if I don’t, I will have to do this:

Paginate(
  Match(
    Index('transactionsByDescription'),
    'tes'
  ),
  {
    size: 10
  }
)

This works really well, and the pagination is the top query which means there will be no problem, but that’s only going to work IF the search string is always 3 characters long because that satisfies the trigram.

{
  data: [Ref(Collection("transactions"), "278191047968817665")]
}

But if the search string is not 3 characters, like:

Paginate(
  Match(
    Index('transactionsByDescription'),
    'testing'
  ),
  {
    size: 10
  }
)

It will result to an empty array:

{
  data: []
}

Because the search string no longer satisfies the trigram approach

I found an answer:

Paginate(
  Intersection(
    Map(
      NGram(
        LowerCase('Testing'),
        3,
        3
      ),
      Lambda(
        'needle',
        Match(
          Index('transactionsByDescription'),
          Var('needle')
        )
      )
    )
  ),
  {
    size: 10
  }
)

Which will result in:

{
  data: [Ref(Collection("transactions"), "278191047968817665")]
}

Not sure if this is the correct approach though