Ever since upgrading our deployment with Federate 7.17.3-27.2 we are starting to see the following responses from time to time from the siren/_search endpoint.
{
"error" : {
"root_cause" : [
{
"type" : "i",
"reason" : "Unable to acquire a search lock on indices MY_INDEX_NAME",
"suppressed" : [
{
"type" : "index_not_found_exception",
"reason" : "no such index [MY_INDEX_NAME]",
"index_uuid" : "XXXXXXXX",
"index" : "MY_INDEX_NAME"
}
]
}
],
"type" : "i",
"reason" : "Unable to acquire a search lock on indices MY_INDEX_NAME",
"suppressed" : [
{
"type" : "index_not_found_exception",
"reason" : "no such index [MY_INDEX_NAME]",
"index_uuid" : "XXXXXXXXXXXX",
"index" : "MY_INDEX_NAME"
}
]
},
"status" : 500
}
It does not occur every single time. Retrying the request can sometimes make the issue go away. I’m not sure why it thinks the index isn’t found because it’s definitely there and I have no issues querying it from the normal Elasticsearch /_search endpoints.
My current workaround is to clone the index and that seems to solve the problem but then within hours or days it will start complaining about the same issue with a different index.
Is this a known issue with this version? Any ideas how to fix this?
Can you please confirm how did you perform the Siren federate upgrade is it a Rolling upgrade? What was the last version of Siren federate?
Also please cross check the disk space generally when the disk space reaches 90%-95% used Elasticsearch indices get locked when there is a shortage of disk space on the server.
These indexes are all part of an alias managed by ILM
There are 6+ nodes and each index has 1 primary and 1 replica shard
All nodes have 1+ TB of free disk space
I have not noticed any shard relocations beyond the usual ILM ones, but the specific indexes in question had been rolled over already
I’ll stress that normal Elasticsearch search requests have no issues on the same indexes. It’s only when I switch to the /siren/INDEX/_search endpoint that I see these so it feels plugin-specific rather than a misconfiguration with the underlying index. I cannot think of any configuration change that we have made beyond updating Elasticsearch and the Federate plugin.
Happy to provide any other info you think would be helpful.
I wanted to add that my team just ran into this exact issue. We were upgrading from ES 6.8.2 to 7.17.3-27.2 on a 21 node cluster.
The issue only occurred when running siren queries on indices that had also been written to after the upgrade.
We tried restarting the elastic service on each node and that seemed to resolve the problem. Though we are eager to hear if there is any more information about this issue.
@microsen Just tried restarting the problem cluster and you’re right, restarting seems to have resolved the problem. I’ll report back here if the issue crops back up, but restarting seems to be a valid workaround in the meantime.
I spoke too soon… It seemed to help initially but the errors eventually showed back up. We’re seeing a pattern where it happens after ILM moves and merges an index from a hot to a warm node.
I’ve been running more tests on my end and I believe it has to do with shard location. One of our clusters uses ILM but does not relocate shards to warm nodes and is not giving us these errors.
So, I ran a test where I confirmed that siren searches were responding without errors. I then manually rolled over the index. As soon as the shards moved to their warm nodes siren started throwing the index not found errors. I manually moved the shards back to their original nodes and the siren requests started succeeding again.
I guess the workaround would be to ensure shards don’t move but that only works as a temporary solution because then a single node will start to be overloaded.
@Manu_Agarwal Could you please try replicating the above scenario? We are deployed in Elastic Cloud and are unable to roll back to prior versions of ES + plugin so we’re a bit stuck here.
Thank you for these useful details which allowed us to identify a malfunction related to indices rerouting, we are still investigating on it, and we’ll provide a patch release as soon as possible.
We could replicate the scenario you described, the bug was fully identified and is now fixed.
Thanks again for taking the time to report this issue.
The fix will be available in the next patch release coming shortly.
Looks like the link is working now. We deployed this and ran a few tests with relocating shards and all seems to be resolved! Thank you for your open communication about this issue and quick resolution.