How to build a search index with FQL?

TL;DR what is the best way to implement search functionality in FQL?

I’ve been building a site that has a search functionality. It lets you search for emojis, and categories of emojis. So the category “horror” might contain these emojis :ghost::clown_face::zombie:, and the user can find that category by searching “horror”, “spooky”, “scary” etc. The search bar is live, so if you type a partial string match like “spoo” that would work too.

I’m looking for the simplest solution to achieve something like this. I actually had a working version of it, using the undocumented Ngram function for FQL, but something has changed and that no longer works.

A search like this seems like something that should be simple to implement. After all it’s very common for a site to have a search function, and searching for partial strings is something users have got used to. e.g. every time you search in an address bar.

What I’ve tried so far…

This all worked fine until recently. I lifted some code from the Fwitter Tutorial and altered it so that multiple keywords could be searched for each of the categories. I used the Ngram approach and used this code to create an index:

const createEmojiSetNgramIndex = CreateIndex({
    name: 'emojiset_by_exact_ngrams_full',
    source: [
        {
            collection: [Collection('EmojiSet')],
            fields: {
                wordparts: Query(Lambda('match', Map(Select(['data', 'queries'], Var('match')), Lambda('query', GenerateNgrams('query', Var('match'))))))
            }
        }
    ],
    terms: [
        {
            binding: 'wordparts'
        }
    ]
})

That worked fine for a while. But since then the code no longer works. I haven’t made any changes to the code. I’ve tried to build an identical index, the index builds but returns no results.

Any prod in the right direction would be really appreciated. Thanks.

Hi @ShadowfaxRodeo,

how do your documents look like?

Luigi

Here’s the document for one of the emoji categories. The queries is what I’d like to search.

{
  "ref": Ref(Collection("EmojiSet"), "266973888963936774"),
  "ts": 1602448054517000,
  "data": {
    "name": "equines",
    "queries": [
      "horse",
      "horses",
      "equines",
      "equine"
    ],
    "emojis": [
      Ref(Collection("Emoji"), "266973412635708935"),
      Ref(Collection("Emoji"), "266973415107201542"),
      Ref(Collection("Emoji"), "266973415378780678"),
      Ref(Collection("Emoji"), "266973416143202821"),
      Ref(Collection("Emoji"), "266973416698939911"),
      Ref(Collection("Emoji"), "266973417648951814")
    ],
    "note": "",
    "sort": "",
    "published": true
  }
}

Hi @ShadowfaxRodeo,

It might very depends on how many tags do you have (in term of efficiency), but you can try with a function like this:

CreateFunction(
  {
    name:'getEmojis',
    role: null,
    body:
      Query(
        Lambda(['term'],
          Let(
            {
              partialTerm: Var('term'),
              matchingTerms: Select(['data'],Paginate(Filter(Distinct(Match('allQueries')),Lambda('x',ContainsStr(Var('x'),Var('partialTerm'))))))
            },
            Reduce(
              Lambda(['acc','value'],Append(Var('acc'),Select(['data'],Paginate(Match('refByQuery',Var('value')))))),
              [],
              Var('matchingTerms')
            )
          )
        )
      )
  }
)

It uses 2 indexes as below:

{ serialized: true,
  name: 'allQueries',
  source: Collection("EmojiSet"),
  values: [ { field: [ 'data', 'queries' ] } ],
  partitions: 8 }

and

{ serialized: true,
  name: 'refByQuery',
  source: Collection("EmojiSet"),
  terms: [ { field: [ 'data', 'queries' ] } ],
  partitions: 1 }

The function first scans the index with all tags looking for a partial match, then extracts such terms and looks into the second index for finding the matching documents.

You can give a try and see how it works.

Luigi

Can you show us your NGram code? Nothing has changed to the function as far as I know so that should still be working.

Thanks Luigi!
I’ll give it a go now.

Sure thing, that was an oversight it should have been in the original post.
Here’s the GenerateNgrams function. I’m almost certain it is lifted straight from this Fwitter Tutorial github page.

function GenerateNgrams (Phrase) {
    return Distinct(
        Union(
            Let(
                {
                    // Reduce this array if you want less ngrams per word.
                    indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                    indexesFiltered: Filter(
                        Var('indexes'),
                        // filter out the ones below 0
                        Lambda('l', GT(Var('l'), 0))
                    ),
                    ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Phrase), Var('l'), Var('l'))))
                },
                Var('ngramsArray')
            )
        )
    )
}

Definitely from the Fwitter tutorial. As far as I can see this still works:

So I’m assuming there might be something else wrong when you say this:

In that case something else must be wrong. But I have no idea what it is. My data hasn’t changed shape. And the indexes are being built they just don’t return anything.

This part of your binding makes little sense to me though.
The way you use ‘query’ as a variable is wrong, you are just passing in a string ‘query’ to GenerateNGrams instead of the Var(‘query’). The function also takes only one argument yet you are calling it with two arguments, so I don’t think that this makes sense.

Thanks for all your help. My first thought was when you pointed that out was to facepalm, but I fixed it to

Lambda('query', GenerateNgrams(Var('query')))

and it’s still producing the same behaviour, creating an index but not returning any results.

I must have more mistakes. Which leaves the mystery of how I ever managed to get it working in the first place.

So i went back to the Fwitter tutorial and lifted the code out again. This is the function it uses for generating the Ngrams :

function WordPartGenerator (WordVar) {
    return Let(
        {
            indexes: q.Map(
                // Reduce this array if you want less ngrams per word.
                // Setting it to [ 0 ] would only create the word itself, Setting it to [0, 1] would result in the word itself
                // and all ngrams that are one character shorter, etc..
                [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                Lambda('index', Subtract(Length(WordVar), Var('index')))
            ),
            indexesFiltered: Filter(
                Var('indexes'),
                // filter out the ones below 0
                Lambda('l', GT(Var('l'), 0))
            ),
            ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(WordVar), Var('l'), Var('l'))))
        },
        Var('ngramsArray')
    )
}

This is the function I’m using to build the index:

const createEmojiSetNgramIndex = CreateIndex({
    name: 'test_index',
    // we actually want to sort to get the shortest word that matches first
    source: [
        {
            // If your collections have the same property tht you want to access you can pass a list to the collection
            collection: [Collection('EmojiSet')],
            fields: {
                wordparts: Query(
                    Lambda('emojiset',
                        Map(Select(['data', 'queries'], Var('emojiset')), Lambda('query', WordPartGenerator(Var('query'))))
                    )
                )
            }
        }
    ],
    terms: [
        {
            binding: 'wordparts'
        }
    ]
})

The only difference between my function and the Fwitter function is mine maps over an array of strings instead of just one. But I’m still getting the same behaviour.

Okay, so the issue was not only what you pointed out, but also that the terms were being returned in nested arrays. So Union() on the Mapped queries and on the Ngram generating function, to flatten the array was the final hurdle. Also added in Distinct().

const createEmojiSetNgramIndex = CreateIndex({
    name: 'test_index',
    // we actually want to sort to get the shortest word that matches first
    source: [
        {
            // If your collections have the same property tht you want to access you can pass a list to the collection
            collection: [Collection('EmojiSet')],
            fields: {
                wordparts: Query(
                    Lambda('emojiset',
                        Distinct(Union(Map(Select(['data', 'queries'], Var('emojiset')), Lambda('query', Distinct(Union(WordPartGenerator(Var('query'))))))))
                    )
                )
            }
        }
    ],
    terms: [
        {
            binding: 'wordparts'
        }
    ]
})

Thanks for all your help.