How long does index creation take within the community?

Hi all,

I have a collection of just over 37,500 documents (each document is about 30-33,000 lines and would be approx 400-500kB in JSON format) and require that a large number of indexes be created (currently 30 though will grow to well over 200) to support sorting of documents as they are relevant to different locations. The indexes are all created through Python driver of the form:

                  q.create_index({
                        "name": index_name,
                        "source": {
                            "collection": q.collection("Recipe"),
                            "fields": {
                              "sort_result": q.query(
                                q.lambda_(
                                  "doc",
                                  q.select(["data","algorithm_result",location_hash],
                                           q.var("doc")))
                                )
                            }
                        },
                        "terms":
                            [
                                { "field": ["ref"] }
                            ],
                        "values":
                            [
                                { "binding": "sort_result", "reverse":True},
                                { "field": ["ref"] }
                            ]
                        }
                    )
                )

It’s been 48 hours and 5 of the 29 such indexes have completed building (the rest are in progress). Is it reasonable to expect that these indexes would take this long? I’m wondering what the rest of the community is seeing similar results. If anyone knows how I could improve this performance, I am all ears. It is a bit worrying as I can’t see this collection growing to a size in the magnitude of millions of documents without this aspect improving.

Unfortunately I don’t think a suitable answer will be as simple as just emptying the collection, creating the new indexes and re-creating the documents that were deleted. In a production environment it is not tenable.

The event history slows down processing. All of the deleting documents still exist in the event history.

Setting the collection’s history_days to 0 noticeably speeds things up. But to notice this right away, you will want to start with a new collection that doesn’t have an event history.

Thanks @summer,

I have made the change to stop recording event history and will look to see how this improves.

This may be an oversimplification though is there a method of calculating how much data is being stored in event history if this is a proxy for impairment? In my case, I can’t imagine that there would be that much given I’ve had no more than 12,000 documents in the database at any one time. Many of these indexes are still loading and we are now 72 hours on.

I have made the change to stop recording event history

I would suggest starting with a brand new collection with history_days set to 0, because I’m not sure if the change goes into effect immediately–there might be some garbage collection that needs to take place (but please don’t quote me on that as I am not one of the database engineers, just making a guess).

I’ve had no more than 12,000 documents in the database at any one time

From my understanding, “at any one point in time” doesn’t matter as much as “ever, total” meaning if you have ~10k documents, then delete all of them and add ~10k new documents, it’s actually as if you have ~20k, and so on… But, again, that’s just a rough guess.

I’ll see if I can get a more definitive answer on what’s happening here.

Meanwhile, as a workaround to get unblocked, I would:

  1. Create a brand new collection with <100 documents and history_days set to 0
  2. Create all of the indexes
  3. Add the rest of the data

@Darryl_Naidu, It looks like something unusual was happening with your index builds over the weekend. We are investigating and you should be receiving an email shortly.

Thanks @summer, I think the work done over the weekend worked to resolve the issue. Thanks for your support. If there was anything I was doing that lead to this abnormality, let me know as it would be good to ensure I’m not falling foul in future.