A few weeks back, our Elasticsearch cluster stopped executing any watchers. Doing initial analysis it looked like there is some problem with AWS SMTP service. As we use AWS SMTP for sending mail alerts to our LDAP accounts. After going through more logs and spending some time in understanding the sent mail statistics on AWS, thanks to AWS for providing intuitive UI to get insights of emails that are getting rejected. We were sure there is no problem with sending of email but something is wrong on the current master. Analyzing below log line it was clear that there is some issue with .watcher index.
2018-05-10T07:05:16,969][WARN ][o.e.g.DanglingIndicesState] [es-master-1] [[.watches/23nm9NSrSkeZaK4Dtyughg]] cannot be imported as a dangling index, as index with same name already exists in cluster metadata
Resolution
- Delete the local directory: The log line tells the node name that is holding a stale copy of index along with the directory name. In our case it was es-master-1 node name with the directory 23nm9NSrSkeZaK4Dtyughg under data folder for the master.
- Restart Watcher Service: Once the stale index directory is deleted, restart the watcher service
POST _xpack/watcher/_restart