Searching a collection through a pivot collection on a many-to-many relationship

aprilmintacpineda · November 3, 2022, 11:55pm

Hello world!

So I have these two entities users and organizations and they have a many-to-many relationship, i.e., one user maybe a part of many organizations; one organization may have many users.

Now what I want to implement is searching users by name in an organization. So like if there’s an organization called FaunaDB, and it has 23 users, I want to search all users that is a part of that organization by their name, however, the way that search works in FaunaDB (afaik) is by implementing an NGram using index bindings, and bindings has to be a pure function. So the index would look like:

CreateIndex({
  name: 'searchOrganizationUsersByName',
  source: {
    collection: Collection('organizationUsers'),
    fields: {
      name: Query(
        Lambda(
          'doc',
          Distinct(
            Union(
              NGram(
                LowerCase(
                  Select(['data', 'firstName'], Var('doc'))
                ),
                2,
                2
              ),
              NGram(
                LowerCase(
                  Select(['data', 'middleName'], Var('doc'))
                ),
                2,
                2
              ),
              NGram(
                LowerCase(
                  Select(['data', 'lastName'], Var('doc'))
                ),
                2,
                2
              )
            )
          )
        )
      )
    }
  },
  terms: [
    { binding: 'organization' },
    { binding: 'name' }
  ],
  values: [
    { field: ['ref'] },
    { field: ['data', 'user'] }
  ]
});

So you’d use this as:

Map(
  Paginate(
    Match(
      Index('searchOrganizationUsersByName'),
      Ref(
        Collection('organizations'),
        '123123132'
      ),
      <user input here>
    )
  ),
  Lambda(
    ['ref', 'user']
    Get(Var('user'))
  )
)

However, as you can see, this index is targeting the pivot collection organizationUsers which does not contain the users’ firstName, middleName and lastName. So my question is for more advanced FQL users out here, how should I implement this feature? I’m not sure if I should do Join, frankly I haven’t used it outside of sorting a result from another index so I don’t fully understand how that one works yet. The idea I have (which I doubt if correct) is to store these fields in the pivot collection as well, though that will result in a duplicate which I’m not comfortable with.

ptpaterson · November 4, 2022, 3:57pm

Hi April!

As you discovered, you cannot fetch additional data in an index binding, that’s the “pure” function requirement. So, you cannot do this search with a single Index.

I am thinking of three general techniques you can try.

Duplicate data

Okay, so you could do this with a single Index if you duplicate some data from the user into the organizationsUser Collection.

For example, in your case every time you update an organizationUsers document, you can copy the user info to it. Although now you also need to make sure that every time you update the fields in the user, you update all of its related documents.

Intersection with multiple Indexes

You could also take the Intersection of users_by_organization and users_by_name (with the index binding with ngram).

This could be an expensive query if the total number of all users is large. This might provide the most simple FQL query to write, but it might not actually perform well if the number of user matches is large.

Just Filter instead of using Indexes

Otherwise, the easiest thing may just be to get the list of users for an organization and Filter them. That is, don’t use an index to find those users with the particular name. Filtering a result set of dozens, or even a few hundred documents (I’m assuming it is rare to have an organization with thousands of users in your app)

CreateIndex({
  name: "organizationUsers_by_organization__user_asc",
  terms: [
    { field: ["data", "organization"] }
  ],
  values: [
    { field: ["data", "user"] },
    { field: ["ref"] }
  ],
})

Let(
  {
    users: Map(
      Paginate(Match(Index("organizationUsers_by_organization__user_asc"), ORGANIZATION),
      Lambda(["user_ref", "ref"], Get(Var("user_ref"))
    ),
  },
  Filter(
    Var("users"),
    Lambda(
      "user",
      /* logic to determine match */
    )
  )
)

If you want to search with Fuzzy logic, then you can use ngram as part of the filter logic. What I would typically recommend is falling back on fuzzy logic after you check for something more exact (and cheaper to run) and want more results. For example,

Check for exact match (could be a simple Index)
ONLY If more results are desired, check for partial match, e.g. with ContainsStr.
ONLY If more results are desired, check fuzzy search.

Or some combination that you find is efficient for your data.

Polecat7 · November 9, 2022, 9:25am

Hi @aprilmintacpineda

@ptpaterson has already written a really good reply.

I did think it worth mentioning that there is another approach that might be worth pursuing.

Add an array field called organizationList to users.
When a user is added to an organisation, as well as adding an entry to the organizationUsers collection, you would update the organizationList array in the user’s document with a reference to the organisations collection.
The searchOrganizationUsersByName index would be altered to point to the users collection instead of the organizationUsers collection.
You would add the organizationList field from users as a search term
Because organizationList is an array in users, each entry in that array gets its own index entry in the searchOrganizationUsersByName index, making filter simple and fast.
I presume that the terms would be better written

terms: [
    { binding: 'name' },
    { field: ["data", "organizationList"]}
  ],

as name will reduce the matches much faster than searching by organisation first.

Since most users will not belong to too many organisations (I presume), this won’t cause issue with the users collection, or the indexes. It should be very fast, and less compute intensive (so cheaper to run as well).
While it does mean you have to add to or remove from the organizationList field in users when they are added or removed from an organisation, a UDF could be used to encapsulate that behaviour and keep it simple.

Hope that helps.

ptpaterson · November 9, 2022, 5:48pm

Agreed, that is a clever option! It looks much more straight forward then my own “duplicate data” example. I definitely don’t want to suggest that the ideas I had were definitive. All other ideas and discussion is most encouraged!!

@Polecat7 Is that a pattern that you are using or have used yourself? I am curious how you or others manage duplicated/denormalized data. Fauna’s really big on “Document Relational” (e.g. normalizing your data like an RDBMS), but clearly some optimizations can be made if you can manage the business logic to keep everything consistent.

aprilmintacpineda · November 9, 2022, 11:04pm

@Polecat7 @ptpaterson thank you both for the reply, that was actually very insightful!

On using `organizationList` field

Yeah I presume that users won’t be a part of too many organizations, but one consideration is that afaik (correct me if I’m wrong @ptpaterson) the query:

Match(
  Index('searchOrganizationUsersByName'),
  Ref(
    Collection('organizations'),
    '123123132'
  ),
  <user input here>
)

will run on O(n) and the users collection can have anywhere from a few thousands to a few millions of records, so searching an index that references all users will be expensive, unlike searching an index that references users by organizations, because organizations will only likely have a few tenths to a few thousands (for extremely big companies like meta, twitter, etc) of users.

I think this approach is identical to Intersection with multiple Indexes approach, just in a different flavor.

Source: Database performance - Fauna Documentation

On filter approach

I imagine this can be compute intensive as I need to do NGram on the fly.

Thanks for the insights guys, I really appreciate it! Interesting ideas!

ptpaterson · November 10, 2022, 12:46am

Paginate(Match(Index(...), ...)) does not perform a table scan, the Match Set points directly to the data. So, finding your data with an Index is done in constant time O(1), though of course more results mean more data to grab and send over the wire.

Executing additional Set operations like Intersection, Count, etc. have to take into account the number of results. For example, the complexity of the following Intersection depends on the product of the results for each of the input Sets.

Paginate(
  Intersection(            // O(n1 * n2 * n3)
    Match(Index(...), ...),  // O(1), returns n1 results
    Match(Index(...), ...),  // O(1), returns n2 results
    Match(Index(...), ...)   // O(1), returns n3 results
  )
)

Possibly. It depends on how well the index narrows the results down. Results for indexing user names with ngram does sound like it will keep scaling with the total number of users. On the other hand, a unique index on user emails, even with many millions of users, is an efficient, O(1) lookup.

The total size of the Collection has no impact on the performance of an Index. Ideally, what you want is to have index terms that cover a bounded number of results that don’t scale with the whole Collection.

aprilmintacpineda · November 10, 2022, 1:12am

Great! I didn’t know that Match performs O(1)!

Polecat7 · November 10, 2022, 5:39am

@ptpaterson @aprilmintacpineda thanks for your replies. Good reading.

@ptpaterson this is a pattern I am actively working on for a document locking scheme I have put together to enable users to manage permissions on individual documents being shared between different user groups, where funnily enough those groups are themselves inside organizations.

In answer to your query, I decided to do a full write up of how I plan to implement this, but altered it to make it suitable for what @aprilmintacpineda is doing (and for easier context to anyone that comes along later and reads this thread).

I call this pattern the “Tertiary Index Pattern” - and so organizationList would be a tertiary index.
I don’t know if there already exists another name to describe this, or if the term Tertiary Index butts heads with something else, but for now it will do.

Here’s the pseudo code:

Enforce Unique Entries

export default CreateIndex({
  name: 'organization_users_unique',
  source: Collection('organisationUsers'),
  
  terms: [
    {
      field: ['data' , 'organisation'],
      field: ['data' , 'user']
    }
  ],

  values: [
    { field: ['data' , 'organisation'] },
    { field: ['data' , 'user'] }
  ],

  // uniqueness works on the combination of terms/values
  unique: true,
  serialized: false
})

Note: that the ‘ref’ of the record is missing from terms and values above - this is because having the ref as one of the values would mean that the uniqueness check would become meaningless because every new primary key for a record is different, so you could have two records with the same user and organisation, and it would not be blocked because the ref would be different.
Note: Performance wise, if you are going to have thousands of users regularly added to an organisation, it is much better to have the two terms: user and organisation - this will speed up Fauna’s ability to check for duplicates when you are creating a new document in organizationUsers. You can also query this index directly to see if a user has already been added to an organisation.

Fetch all users of an organization

export default CreateIndex({
  name: 'organization_users_by_organisation',
  source: Collection('organisationUsers'),
  
  terms: [
    {
      field: ['data' , 'organisation']
    }
  ],

  values: [
    { field: ['data' , 'user'] },
    { field: ['data' , 'organisation'] },
    { field: ['ref'] }
  ],

  // uniqueness works on the combination of terms/values
  unique: false,
  serialized: false
})

Note: this index can be used to fetch all users of a given organisation. It orders the results by user so search results are easier to search. The ref is also included so that you can quickly access and alter the organizationUsers document if needed.
If you don’t want to have two indexes that cover this, you could alter this index by removing the ref value, and setting the unique flag to true. This would enable this index to both serve to fetch users by their organization, and also to enforce uniqueness - however it will not be as performant as the organization_users_unique index above.

Adding a user to an organisation (support function)

This is a support function - do not call this function externally. It should only be used by AddUserToOrganization (documented below) or RemoveUserFromOrganization (an example name, but not documented below).

// Add or remove organization entries from a user's document
export default CreateFunction({
  name: 'UpdateUserOrganisations',
  body: Query(
    Lambda(
      ['user_id_or_ref', 'add_organization_id_or_ref', 'remove_organization_id_or_ref'],
      Let(
        {
			// Validate and fetch user document
          user_ref : If(
            IsRef(Var('user_id_or_ref')),
            Var('user_id_or_ref'),
            Ref(Collection("users"), Var('user_id_or_ref'))
          ),
          user : If (
            Exists(Var('user_ref')),
            Get(Var('user_ref')),
            Abort('Valid User Not Passed In')
          ),

          // Validate organization(s), but don't bother fetching them
          add_organization_ref : If(
            IsNull(Var('add_organization_id_or_ref')),
            null,
            If (
              IsRef(Var('add_organization_id_or_ref')),
              Var('add_organization_id_or_ref'),
              Ref(Collection("organizations"), Var('add_organization_id_or_ref'))
            )
          ),
          remove_organization_ref : If(
            IsNull(Var('remove_organization_id_or_ref')),
            null,
            If (
              IsRef(Var('remove_organization_id_or_ref')),
              Var('remove_organization_id_or_ref'),
              Ref(Collection("organizations"), Var('remove_organization_id_or_ref'))
            )
          ),
          organization_valid  : If (
            Or(
              And(Not(IsNull(add_organization_ref)), Exists(Var('organization_ref')))
              And(Not(IsNull(remove_organization_ref)), Exists(Var('remove_organization_ref')))
            ),
            true,
            Abort('Valid Organization Not Passed In')
          ),



          // Fetch and update the organizationList
          organizationList : Select(['data','organizationList'], Var('user'), []),

			// Adding items
          organizationList_update1 : if (
            Or(IsNull(Var('add_organization_ref')), IsEmpty(Var('add_organization_ref'))),
            Var('organizationList'),
            Union(Var('organizationList'), Var('add_organization_ref'))
          ),

			// Removing items
          organizationList_update2 : if (
            Or(IsNull(Var('remove_organization_ref')), IsEmpty(Var('remove_organization_ref'))),
            Var('organizationList_update1'),
            Difference(Var('organizationList_update1'), Var('remove_organization_ref'))
          ),
        },
        
        Update(
          Var('user_ref'),
          {
            data: {
              organizationList : Var('permission_group_update2')
            }
          }
        )
      )
    )
	),
  role: 'server'
})

Adding a user to an organisation (API)

Function used as API to add a user to an organization.

export default CreateFunction({
  name: 'AddUserToOrganization',
  body: Query(
    Lambda(
      ['organization', 'user'],
      Let(
        {
          isNewMapping : Not(Exists(Match(
            Index('organization_users_unique'),
            Var('organization'),
            Var('user')
          ))),
          result : Do (
            Var('isNewMapping'),
            Create(
              Ref(Collection("organizationUsers"), {
                data: {
                  user : Var('user'),
                  organization : Var('organization')
                }
              }
            ),
            Call('UpdateUserOrganisations', ['user', 'organization', null])
          )
        },
        Var('result')
      )
    )
  ),

  role: 'server'
})

To add a user to an organisation, you would then call it like this

Call('AddUserToOrganization', [
  Ref(Collection('organizations'), 123),
  Ref(Collection('users'), 456),
])

The code should be mostly self explanatory. AddUserToOrganization uses an index to see if a mapping already exists, if it does it will return ‘false’, else it will add the mapping to organizationUsers, and then use the UpdateUserOrganisations function to update the ‘tertiary index’ organizationList on the user’s document. I call this a tertiary index, but I am not sure if there is a proper name for using an array like this inside a document as an index.
UpdateUserOrganisations checks that the user record exists, that the organization being added or removed also exists, and then adds or removes the organization entry. Note that technically the Union and Difference functions support arrays, but the code that validates the add_organization or remove_organization values does not support arrays (no doubt there is a way to do this, but that would just make things more complicated than they need to be - and a use case that is probably not that likely?)

Again, this code has not been tested, but the principle should work - and it is what I will be working on shortly.

Status Field

One other point I would note is that I typically never delete documents (or records from a DB). Instead I put a status field on every one, and when a document is archived or ‘deleted’ I update the status value rather than removing it from the collection/table.
This means that I also use status values as part of my indexes to quickly filter items that are valid, verses data that is old.

The reason I don’t delete data is because it is easier to undo (when a user changes their mind about something) and because deleting data is not instant - so even if you delete something it still sits there for a while (and in the case of MySQL will be there forever until you run an optimisation on the DB or table affected - and will slow things down in the mean time as it has to keep jumping over deleted items). Fauna cleans up by itself (not having to manage that is pretty cool) and I presume is not affected in the same way as SQL DBs are by deleted records/documents hanging around, though there is still a delay in actually removing the data in Fauna.

From experience, I think it is better not to delete data. Obviously, if you are creating massive documents that use lots of space, it may make sense to just delete them - though I would probably choose to compact them and then archive them instead (keep them in the same collection, with status archive).

Status field values I use in code are:

STATUS_DISABLED
STATUS_ENABLED
STATUS_ARCHIVED
STATUS_DELETED

These are constants, with numeric values.

So in the above example code a valid record has status STATUS_ENABLED, and a deleted entry would be STATUS_DELETED. Now this obviously requires altering the code above to take the status value into account and alter it from STATUS_DELETED to STATUS_ENABLED in the case of a user being re-added to an organisation. You get the idea.

Good luck!

Edited for clarity.

ptpaterson · February 8, 2023, 5:40am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SQL translation into FQL Query Help	4	416	October 21, 2020
Multilevel index Help fql , indexes	5	319	August 3, 2021
Help with Collection references & data modeling Help fql	5	1820	May 4, 2021
Querying and filtering best practice Help fql , best-practices	5	1240	September 13, 2021
FQLX: How do I Query for entries with their ts within a range? Help fql-v10-beta	2	295	July 26, 2023