How to sort and rank fuzzy logic Ngrams search by most relevant result

I recently changed my search functionality to use trigrams with NGrams. The only issue is, I’m getting back results that contain one part of my input string before I get the relevant result. I am not searching one word at a time, I am trying to search a list of FAQs with questions as the input string. Sometimes I may enter the exact question as my input query that my index but still I wont get the desired result until much later in the resulting array because other questions in the database have similar keywords.

Here is my FQL function. (The _ params are only there because its required by GraphQL on paginated functions.)

Query(
        Lambda(
            ["search_input", "_", "__", "___",],
            Map(
                Paginate(
                    Union(
                        Map(
                            NGram(Var("search_input"), 3, 3), 
                            Lambda(
                                'nGramInput', 
                                Match(Index('search_qa_fuzzy'), Var('nGramInput'))
                            )
                        )
                    )
                ),
                Lambda('ref', Get(Var('ref')))
              )
        )
    )

Here is my index

CreateIndex({
        name: 'search_qa_fuzzy',
        source: {
          collection: Collection('freq_asked_questions'),
          fields: {
            ngrams: Query(
                Lambda(
                    'questionDoc', 
                    Distinct(
                        NGram(
                            LowerCase(
                                Select(['data', 'question'], Var('questionDoc'))
                            ), 
                            3, 
                            3,
                        )
                    )
                )
            )
          }
        },
        terms: [
          {
            binding: 'ngrams'
          }
        ]
      })

For example purposes, say the 5 questions I have in my database are:

“How big is the Earth?”
“How many people live on Earth?”
“What planets are further than Earth to the sun?”
“Is the Earth over 2 billion years old”
“How long ago did dinosaurs live on Earth?”

Currently if I enter this question as my search term: “How long ago did dinosaurs live on Earth?”, it will show up last in my results, despite being an exact match to one of my questions. The other questions will populate as they also contain the word “Earth”.
The only thing I can think of would be to count how many Ngrams in my input query have the most matches to the final result and sort the results by the highest matching Ngrams count. (Essentially creating a ranking system) I would believe this would make my most relevant results appear first as the most matches for the amount of Ngrams should more resemble the question than other results.

i.e. all other results may have a count of 1 because only “Earth” matches while the exact question would have a count of 7 because all words match (Of course these numbers are not accurate because it would not be the words themselves matching but instead substrings of 3 characters each for each Ngram of the words, but you get the point)

I just have no idea how to do this in FQL. Any ideas?

1 Like

Okay so I worked on this all day and got decently far.

Right now I am able to create a rank on a given search result based on how many times it has appeared in the results for other Ngram searches.

Let(
  {
    gramSetResults: Map(
      NGram("Earth ", 3, 3),
      Lambda(
        "nGramInput", 
        Match(Index("search_qa_fuzzy"), Var("nGramInput"))
      )
    ),
    allGramResults: Map(
      Var("gramSetResults"),
      Lambda(
         "gramSet", 
          Select("data", Paginate(Var("gramSet")))
      )
    ),
    uniqueGramRefResults: Select("data", Paginate(Union(Var("gramSetResults")))),
    rankResults: Map(
      Var("uniqueGramRefResults"),
      Lambda("uniqueResultRef", {
        gramRef: Var("uniqueResultRef"),
        rank: Reduce(
          Lambda(
            ["count", "gramRefArr"],
            If(
              ContainsValue(Var("uniqueResultRef"), Var("gramRefArr")),
              Add(Var("count"), 1),
              Add(Var("count"), 0)
            )
          ),
          0,
          Var("allGramResults")
        )
      })
    )
  },
  Var("rankResults")
)

gramSetResults returns a list of Sets for each Ngram by the searched Index.
It would look like this if the search term was “Earth”

[
  {
    "@set": {
      match: Index("search_qa_fuzzy"),
      terms: "Ear"
    }
  },
  {
    "@set": {
      match: Index("search_qa_fuzzy"),
      terms: "art"
    }
  },
  {
    "@set": {
      match: Index("search_qa_fuzzy"),
      terms: "th "
    }
  }
]

Each one of those sets has a list of Refs stored in it that are not unique. To get all of the non-unique refs, allGramResults converts the array of sets into an array of arrays with nonUnique refs in each array.

To get the unique Refs, Union handles this in uniqueGramRefResults

Then it is a matter of ranking the unique Refs based on how often they appear in each of the non-unique arrays correlating to each set return by each Ngram. For this I used Reduce to go through the array of arrays of non-unique Refs and created a counter to increment every time the unique Ref was found in one of those sets. The higher the rank, the closer the result is the input question query. And after testing the validity of the ranking, so far this works quite well as a ranking algorithm. The problem now is sorting the final list.

rankResults returns
a list of refs with a rank attached to the resulting object


[
  {  
      gramRef: Ref(Collection("freq_asked_questions"), "306041453769958979"),
      rank: 10
    },
    {
      gramRef: Ref(Collection("freq_asked_questions"), "306041453769960003"),
      rank: 18
    },
    {
      gramRef: Ref(Collection("freq_asked_questions"), "306041453769961027"),
      rank: 2
    },
    {
      gramRef: Ref(Collection("freq_asked_questions"), "306041453769962051"),
      rank: 12
    }, 

]

Unfortunately, this is where I’m stuck. I dont know how to sort by value in FQL. Im hoping this is significantly easier for someone to solve than the rest of the work I did to get the rank haha. If anyone has any idea please feel free to suggest something.

Are you okay with sending that data to the client and letting the client do the sorting?

Doing the aggregation outside of the Paginate function already means you can’t really use paginated results; You have to get one, sufficiently large enough page to aggregate.

There’s no network performance benefit sorting, and client-side javascript is essentially free compute.

I am certain one can implement a Quick-Sort-By algorithm in FQL, but that could add a lot of complexity that you then have to manage. As fun as it sounds to do the FQL implementation (because I’m weird like that :nerd_face: ) I’d hard-pass in this case, in favor of

[...results].sort((a, b) => b.rank - a.rank)

// done! :D

I should also say that all this work is awesome! Thank you for sharing with the community :smiley:

Im with you that the JS version of sort would be much easier to do in a serverless function or something. However, as you mentioned, this is not ideal because then pagination goes out the window, and pagination was a main focal point of having the search query in the first place. I tried looking into how to do a sort function with FQL. After going through a ton of different sort algorithms, I actually don’t believe this is possible in FQL at this time. I will make a feature request for it but it seems as if this is currently unsolvable with FQL.

I submitted a feature request for a native sort. Eager to see if anyone on the fauna team can come up with a sort function with the native FQL functions. I still do not believe this is possible currently.

To be clear, you are using Reduce on the results to aggregate on the trigrams. That is why you cannot Paginate. The sorting is just something extra in this case.

That doesn’t make anything any easier, though :confused:

Okay. I thought that the pagination had to do with how the results came back conceptually. If the highest ranked search came back at the bottom of a list of 20 documents and I was paginating to only show 10 at a time, the most relevant search wouldn’t show up in that first batch of paginated results which is not ideal at all. I assumed pagination would be useless because it is not guaranteed that the most relevant document will be included in the page you’re paginating against unless it is sorted.

Also Im only using Reduce to get the rank value. Not to produce the final array value. Wouldn’t I be able to paginate results by just calling Paginate over the final array that is returned?

Here is the final query if anyone wants to try out the ranked NGrams search. I removed the “_” parameters as I got rid of the (paginated: true) in my schema.

Query(
  Lambda(
    ["search_input"],
    Let(
      {
        gramSetResults: Map(
          NGram(Var("search_input"), 3, 3),
          Lambda(
            "nGramInput",
            Match(Index("search_qa_fuzzy"), Var("nGramInput"))
          )
        ),
        allGramResults: Map(
          Var("gramSetResults"),
          Lambda("gramSet", Select("data", Paginate(Var("gramSet"))))
        ),
        uniqueGramRefResults: Select(
          "data",
          Paginate(Union(Var("gramSetResults")))
        ),
        rankResults: Map(
          Var("uniqueGramRefResults"),
          Lambda("uniqueResultRef", {
            resultRef: Var("uniqueResultRef"),
            rank: Reduce(
              Lambda(
                ["count", "gramRefArr"],
                If(
                  ContainsValue(Var("uniqueResultRef"), Var("gramRefArr")),
                  Add(Var("count"), 1),
                  Add(Var("count"), 0)
                )
              ),
              0,
              Var("allGramResults")
            )
          })
        )
      },
      Var("rankResults")
    )
  )
)

Index remains the same as before

CreateIndex({
        name: 'search_qa_fuzzy',
        source: {
          collection: Collection('freq_asked_questions'),
          fields: {
            ngrams: Query(
                Lambda(
                    'questionDoc', 
                    Distinct(
                        NGram(
                            LowerCase(
                                Select(['data', 'question'], Var('questionDoc'))
                            ), 
                            3, 
                            3,
                        )
                    )
                )
            )
          }
        },
        terms: [
          {
            binding: 'ngrams'
          }
        ]
      })

And here is a sample version of my schema

type FAQ @collection(name: "freq_asked_questions") {
  question: String!
  answer: String!
}

type SearchOutput @embedded {
  resultRef: FAQ!
  rank: Int!
}

type Query {
  searchQA(question: String!): [SearchOutput]! @resolver
}

Right. If you want your method to be successful your initial page needs to be as big as possible/reasonable.

Consider that Count is an O(n) operation (It scans the whole Set in order to work). A GroupBy function would likely be so too. What you’re doing is doing a full table scan intentionally to group all index entries by their Ref.

Unfortunately, paginating is limited to 100000 documents.

Other things to try

  • Check for exact match first: Intersection(Var("gramSetResults"))
  • Do exact match for whole words first, make those more “relevant”
  • Add tags or labels fields to you documents. Match whole words on those also before falling back on fuzzy search.
  • Add a value to your index to sort by some inherently “trendy” value. Not every application is twitter, though :grimacing:
1 Like

Consider that Count is an O(n) operation (It scans the whole Set in order to work). A GroupBy function would likely be so too. What you’re doing is doing a full table scan intentionally to group all index entries by their Ref.

Can you elaborate on this? Where would you use Count to simplify? Also I do not see GroupBy in the docs.

So I was wrong. I was very very wrong. It actually IS possible to make a sorting algorithm using FQL functions. It’s just a nightmare. I’m not even going to pretend like I came up with this myself. Fortunately some hero out there made a sorting helper function using FQL and other helper functions they also made. I deconstructed the source code for all of the nested functions in this sort method and customized it for this query to sort by descending rank. Check out the repo/package if you want the original source code for the sort method developed by shiftx.

Here is the full completed query with sorting added.

Query(
        Lambda(
          ["search_input"],
          Let(
            {
              gramSetResults: Map(
                NGram(Var("search_input"), 3, 3),
                Lambda(
                  "nGramInput",
                  Match(Index("search_qa_fuzzy"), Var("nGramInput"))
                )
              ),
              allGramResults: Map(
                Var("gramSetResults"),
                Lambda("gramSet", Select("data", Paginate(Var("gramSet"))))
              ),
              uniqueGramRefResults: Select(
                "data",
                Paginate(Union(Var("gramSetResults")))
              ),
              rankResults: Map(
                Var("uniqueGramRefResults"),
                Lambda("uniqueResultRef", {
                  resultRef: Var("uniqueResultRef"),
                  rank: Reduce(
                    Lambda(
                      ["count", "gramRefArr"],
                      If(
                        ContainsValue(Var("uniqueResultRef"), Var("gramRefArr")),
                        Add(Var("count"), 1),
                        Add(Var("count"), 0)
                      )
                    ),
                    0,
                    Var("allGramResults")
                  )
                })
              ),
              sortedResults: Reduce(
                Lambda(
                  ["result", "refRank"],
                  Let(
                    {
                      length: Count(Var("result")),
                      index: If(
                        Equals(0, Var("length")),
                        0,
                        
                        Select(
                          1,
                          Select(
                            0,
                            Filter(
                              Reduce(
                                Lambda(
                                  ["acc", "val"],
                                  Append([[Var("val"), Count(Var("acc"))]], Var("acc"))
                                ),
                                [],
                                Var("result")
                              ),
                              Lambda(["nextRefRank", "index"], GT(Select("rank", Var("refRank")), Select("rank", Var("nextRefRank")))),
                            ),
                            null
                          ),
                          Var("length")
                        )
                      )
                    },
                    Let(
                      {
                        arr: Var("result"),
                        index: Var("index"),
                        item: Var("refRank")
                      },
                      If(
                        Equals(-1, Var("index")),
                        Append([Var("item")], Var("arr")),
                        Let(
                          {
                            start: Take(Var("index"), Var("arr")),
                            end: Drop(Var("index"), Var("arr"))
                          },
                          Prepend(Prepend(Var("start"), [Var("item")]), Var("end"))
                        )
                      )
                    )
                  )
                ),
                [],
                Var("rankResults")
              )
            },
          
            Var("sortedResults")
          )
        )
      )

Im not even going to pretend like I understand everything happening here :upside_down_face: All I know is my query works perfectly and now I’m happy. But hey Fauna team, please make a native sort function. I’ll post a generic version in that thread. FQL being too verbose sometimes is an understatement lol.

If anyone wants to switch the rank sorting from descending to ascending (no idea why you’d do that for this use case), change

GT(Select("rank", Var("refRank")), Select("rank", Var("nextRefRank")))

to

GT(Select("rank", Var("nextRefRank")), Select("rank", Var("refRank")))

Outside of that, I’m not really going to be able to answer how the rest of the algorithm works but it does. :upside_down_face: :upside_down_face: Hope someone can use this for a nice search algorithm! I believe the only value anyone would need to change is the index at the beginning.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.