Unexpected (buggy?) Unique Index behavior

tl;dr:

  • It seems like unique indexes that explicitly include identity refs (refs to the given doc) in the values (e.g. values: [{field: ["ref"]}] and also bindings that include identity refs) don’t work. Perhaps indexes include that identity doc in the uniqueness calculation, making the uniqueness constraint useless? It would be great if those refs could be excluded from the uniqueness constraints. If not, documenting this would be useful (maybe I missed it).
  • It seems like null values are not indexed. That makes sense, and can be useful to have indexes only pertinent to a filtered subset of docs (especially if unique worked in that context, per the above)! I didn’t see that in the docs. If it’s actually not there, I’d recommend adding a note to that effect.
  • Open source might be helpful for community debugging.

Details
Hi. I have a few collections that should be unique across two or more fields. With some small surprises, it seems to work as expected when I don’t set values that are the indexed doc.

I’ve spent some time identifying decent examples, simplifying from my actual needs. I could take it further to see if I can find even simpler test cases, but I’m hoping this is enough to communicate reasonably quickly.

In all cases, I’m verifying that the index has active: true before testing (and in fact my test data is small enough that the indexes are active as soon as I create them).

In most cases, I’ve performed these experiments creating indexes both in the shell (CreateIndex) and in the Console’s “Indexes” UI. The results are the same, except when I create bindings, which I can’t do in the UI.

First, here’s an index that does work as expected.

{
  name: "unique_test_Sync",
  unique: true,
  serialized: true,
  source: "Sync",
  terms: [
    {field: ["data", "start"]},
    {field: ["data", "name"]},
    {field: ["data", "conversation"]}
  ]
}

If I make a Sync doc that has no start date (null in the index), a “test” name, and a conversation (a ref to a doc in another collection), then I get expected behavior: a search in the index returns the doc I made, and the index rejects attempts to create another doc with the same three key values.

Now we go into unexpected or buggy behavior. First, let’s delete the index and then make a new one that explicitly specifies a ref as a value.

{
  name: "unique_test2_Sync",
  unique: true,
  serialized: true,
  source: "Sync",
  terms: [
    {field: ["data", "start"]},
    {field: ["data", "name"]},
    {field: ["data", "conversation"]}
  ],
  values: [
    {field: ["ref"]}
  ]
}

Obviously this would be more useful if I had more values there, or a filtering binding, but this is the simplest story that demonstrates the behavior I want to highlight.

Good news: searches still work as expected. Bad news: the unique limitation no longer works. I can create a new document with the same three key values, without the database stopping me.

One hypothesis I had was that you hadn’t built in logic to exclude identity doc refs from the unique logic that looks across terms and values. That’s understandable that it’s not there yet, even if disappointing. (BTW, open source would help folks like me debug issues like this!)

For one other smaller surprise, let’s look at another variant. In this one, we’ll remove the ref and move the start date from a term to a value.

{
  name: "unique_test3_Sync",
  unique: true,
  serialized: true,
  source: "Sync",
  terms: [
    {
      field: ["data", "name"]
    },
    {
      field: ["data", "conversation"]
    }
  ],
  values: [
    {
      field: ["data", "start"]
    }
  ]
}

Remember–the start date is empty (effectively a null, per index docs). In that case, searches (now only of the name and the reference) don’t get any results at all. I’m assuming that this is because null values are not indexed, which makes sense. (I didn’t see that documented, but maybe I missed it.)

I did a number of other side experiments, but they basically confirm that everything is working as expected, with the above two caveats on explicit values and null values.

Thanks for reading!

1 Like

Hi,

This is an expected behaviour. Uniqueness is on the combination of terms and values . https://docs.fauna.com/fauna/current/api/fql/indexes

unique Boolean Optional - If true , maintains a unique constraint on combined terms and values . The default is false .

So below Index will not help enforce uniqueness as long as the combination of terms and values is not unique. .

{
name: “unique_test2_Sync”,
unique: true,
serialized: true,
source: “Sync”,
terms: [
{field: [“data”, “start”]},
{field: [“data”, “name”]},
{field: [“data”, “conversation”]}
],
values: [
{field: [“ref”]}
]
}

If no values are defined in an index, then ref is returned by default. Below index definition is same as above with respect to value plus uniqueness being enforced.

{
name: “unique_test2_Sync”,
unique: true,
serialized: true,
source: “Sync”,
terms: [
{field: [“data”, “start”]},
{field: [“data”, “name”]},
{field: [“data”, “conversation”]}
]
}

For your third scenario, I think we did not document as we see it be obvious that nothing would be returned in values when it is null(internally an Index entry was not created). Will create a ticket to docs to explicitly call it out.

Open Sourcing is something being discussed.

1 Like

Thank you for the reply. Yes, the behavior I described for identity refs is logically consistent, but it is not useful, and thus surprising and unexpected to me as a user. Changing it would allow me to use the index for more purposes than only enforcing uniqueness. So I would be happy to see this as a feature improvement for the future.

Good to hear about the potential open sourcing!

Given that using both terms and values for uniqueness is intentional and beneficial, something would be needed to exclude the values from uniqueness.

@garyposter Do you have a suggestion for how to distinguish an index that uses all terms and values for uniqueness and those that don’t? A valuesAreNotUnique flag? ref is ignored 100% of the time? Something else? It just wasn’t clear to me while I read.

Point is, expecting the existing API to change to just support something new… I wouldn’t expect that to happen. What if you want to sort on a value that is non-unique? What if you DO want the sorted value to be unique?! What if you want some values to be unique and not others? That’s a lot of edge cases that need to considered.

1 Like

Hey, thank you for exploring this with me.

What I expected, and what I would suggest, is that refs for the currently indexed doc (the “identity ref”, I called it) would be automatically excluded In uniqueness constraints. If the values for a given doc included the ref for the same doc, that ref is not included in the uniqueness test. All other terms and values (including other refs) are included, per the current rules.

Why?

  • The identity ref will always make a uniqueness constraint a no-op. If you include an identity ref, you might as well not make the index unique. Every indexed doc in the collection will be unique from an identity perspective, by definition.
  • From my initial perspective, excluding the identity ref from uniqueness was what the default “values” behavior (indexing the ref as the values) sets as precedent. That’s what happens when you don’t set a values. It’s surprising to me that spelling “my values should be a ref to the indexed doc” as a default gets you one uniqueness behavior, while spelling it as a field: [‘ref’] or a binding gets you another. I expected the default behavior to be precisely equivalent to the other spellings, and that is my suggestion.
  • It would be useful to have this behavior. It would mean, for instance, that I could make a uniqueness constraint/index that only applies to a subset of docs, by using a binding in values that only includes the identity doc if it should participate in the uniqueness constraint (by whatever logic). I could use this now. (I can accomplish it in other ways, but they waste storage and make the index less useful for other needs.) It would also mean that a unique index could be sorted (assuming that the sorting value is also a desired uniqueness element), and thus used more flexibly than merely as a uniqueness constraint. I could use this now also. (I can accomplish this by creating an additional index, of course.)

I hope that explanation is clearer and more compelling than my first attempt!

Thanks again,
Gary

In my last bullet point, I tried to give a couple of use cases. Here’s an example of the first one (a binding in values means that uniqueness is only applied to certain docs).

CreateIndex({
  serialized: true,
  name: "unique_upcoming_Sync",
  source: {
    collection: Collection("Sync"),
    fields: {
      onlyUpcoming: Query(
        Lambda(
          ["doc"],
          If(
            And(
              IsNull(Select(["data", "stop"], Var("doc"), null)),
              IsNull(Select(["data", "start"], Var("doc"), null))
            ),
            Select(["ref"], Var("doc")),
            null
          )
        )
      )
    }
  },
  terms: [
    {field: ["data", "name"]},
    {field: ["data", "conversation"]}
  ],
  values: [
    {binding: "onlyUpcoming"}
  ],
  unique: true
})

In other words, “if this doc doesn’t have a start or stop time, please make sure that the combo of conversation and name is unique among other similar docs. Otherwise, don’t worry about it.” To be clear, this doesn’t work now, but it would be cool if it did.

I hope the second use case (“a unique index could be sorted (assuming that the sorting value is also a desired uniqueness element)”) is already reasonably clear, but can provide an example if desired.

@ptpaterson did my second attempt to explain my problem statement and suggestion land more clearly than my first?

Yes, and it is more specific, which is where I was trying to lead you. Before it was more, “I have a problem, can you fix it?” Now it’s more “I have this problem, and if Fauna handled things in this other way, I and others would benefit greatly from it”. The latter being a better request for feature.

My hope is that the more specific we can be with feature requests, the easier that will translate into useful and effective tickets on Fauna’s side.

Cheers! :nerd_face:

1 Like

Hey Gary,

you asked my insight on another ticket so here I am :slight_smile:.
Jay is right that it is ‘expected’ behavior. The uniqueness is defined by values + terms that were explicitely defined in the index definition.

However, your points are also valid, having a reference in these values that counts for uniqueness is quite useless and that is annoying since … let’s face it… you often need a reference. Of course, you can make a separate index for uniqueness only but you are right that that is wasted space. I’ll deliver your feedback, it aligns with other feedback such as: “can we separate sorting from return values”. These are all valid points, currently, FaunaDB gives you raw access to the index and the underlying structure. That’s great for many use-cases, the alternative is that we try to be more intelligent but that can, of course, also go wrong if we sometimes make a ‘not so intelligent’ decision.

I’m sure we will move more towards the things you are asking since it’s indeed confusing. I’m currently undecided whether removing refs from uniqueness is a good idea. It’s definitely more useful but feels inconsistent. We probably need to have separate values altogether for uniqueness. But as I said, if that results in two indexes behind the scene, it’s better from a DevXP perspective yet you still incur the same storage. I’ll deliver your feedback and let product/engineering think about it :slight_smile:

2 Likes

LOL, thanks so much @databrecht. MVP. :wink:

Yeah, raw index access is genuinely fun! It reminds me of ye olde ZODB, if anyone is familiar with that from 10-20 years ago.

Nope, I was still working my way through high-school 20 years ago :slight_smile:
The raw index access is a choice that comes from (as far as I know…)

  • Experience with ‘smart indexes’ not being smart at all and/or indexes not being used due to a query planner that changes the plan as your data grows (in FaunaDB, the query is the query plan and the index… well you get access to the actual data contained by it for you to use how you want.
  • Making sure in a pay-as-you-go system that there is no sudden hidden cost (e.g. by creating a second index behind your back or allowing you to do things the index is not capable of doing by just computing it anyway without the index which might result in a full scan)

Good motivations imo :slight_smile: although it might feel a bit low-level at times.
Nevertheless, there are improvements we can do to the DevXP! :upside_down_face: