In this blog post i want to share my experience and lessons learned while upgrading our ElasticSearch cluster from 5.4 to 5.6 with TB’s of data without any downtime.
To start, i would first like to introduce the play ground and rules that we were required to take care while performing cluster wide upgrade.
- Indexing and Searching should not get impacted, at time of upgrade we were handling indexing @=~ 20k/sec and search rate @=~ 2500/sec
- Watcher/alerts should always be running as they provide us valuable actionable insight to TB’s of data we host
- Security of cluster should not be compramised
- Plugins/visualizations and dashboards are all expected to function as they were before upgrade
As it was a minor/ rolling upgrade we started with one node at a time things were simple when we moved from one DataNode to other DataNode keeping check of below points
- Stop cluster wide allocation of shards. As when we stop one of the nodes for upgrading the ES version, Master will mark some primary and replica shards as missing this will lead to assignment of shards, but we don’t want this to happen. Node is going to join back the cluster in some time.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
- Stopping a node will bring the cluster in Yellow state. Wait for the cluster to change color to green before jumping to next node. Each time you bring the node back after upgrading ES version, set the allocation to true.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
- There is always a possibility that even after setting the allocation true . ES will take lot of time to allocate shards and turn the status to green. If that is the case look for the setting
"index.unassigned.node_left.delayed_timeout":
. You may require to alter this setting to enable ES assign shards
- One of the pain points while doing rolling upgrade with ES is , for some time till the upgrade is over you will have nodes running multiple version of x-pack plugin, this will make it impossible to track the progress from Kibana. Your best bet is to rely on Rest API end points. For instance let’s assume a cluster with 2 client nodes on which kibana is running and rest other 5 nodes acting as datanode and masternode. If we upgrade our datanode1. It’s X-pack plugin version will differ from the client node, as client node is not yet upgraded and even kibana is not yet upgraded. This will prevent the datanode1 to send monitoring data points to kibana and thus making it impossible to visualize progress on kibana UI
- One natural instinct is to upgrade kibana first and rest of the nodes to avoid any problem of monitoring, but it’s is a vicious circle, which is obvious. Kibana 5.6 will not start until all the nodes are at 5.6. Kibana 5.4 can work with some nodes at 5.4 and some at 5.6 but it won’t be able to display monitoring for all the nodes that are at 5.6
- Another pain point is watcher. Watcher runs on Master node. Always,i would repeat always remember to upgrade the master nodes at end. If you upgrade the master nodes first while your some of the data nodes are running on 5.4 your watchers will fail. Even they will keep failing until you have upgraded kibana.
- As the watchers uses .monitoring indexes ,all the watchers will fail even if you have upgraded all your client,data and master nodes until and unless kibana is also not upgraded.