The First Cry of Atom

How to reindex Elasticsearch

Due to the nature of the database system, the schema definition should be evolved as time goes by. We may need to add the type information for new columns, change the name of the same attribute. Otherwise, the database would not be able to deliver the value expected by users. The evolution of database schema is inevitable factor in the context of database systems used in the real business.

Even Elasticsearch requires us to update the definition of the index to meet our requirement from the business perspective or engineering demand. Index in Elasticsearch is a concept corresponding to the table in traditional RDBMS system. We are not going to avoid going to further into the detail of Elasticsearch here. Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine will provide you the enough information to know the underhood of Elaticserch.

In this article, I’m going to illustrate the practice of how to update the existing index of Elasticsearch without downtime by using an alias and reindex API. Assume a current index (myindex_v1) is aliased to myindex.

Create a new index

You can safely create a new index. The name of the new index is myindex_v2.

$ curl -H 'Content-Type: application/json' \
 -XPUT http://<Elasticsearch Host>/myindex_v2 \
 -d @create_index.json

You may want to install specific analyzer at the creation of a new index. This example shows the case to use Koromoji tokenizer to deal with the query of Japanese.

$ cat create_index.json
{
  "index": {
    "analysis": {
      "tokenizer": {
        "kuromoji": {
          "type": "kuromoji_tokenizer"
        }
      },
      "analyzer": {
        "analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji",
          "filter": [
            "cjk_width"
          ]
        }
      }
    }
  }
}

Move data to the new index

Database migration is done by reindex operation in Elasticsearch.

$ curl -H 'Content-Type: application/json' \
 -XPOST http://<Elasticsearch Host>/_reindex \
 -d @reindex.json

The migration target and source index are specified in the body of the request.

$ cat reindex.json
{
  "source": {
    "index": "myindex_v1"
  },
  "dest": {
    "index": "myindex_v2"
  }
}

Change the alias

If clients are accessing by the name myindex, the new index is not still visible to users because myindex is aliased to myindex_v1. It is necessary to update the myindex alias to refer to myindex_v2.

$ curl -H 'Content-Type: application/json' \
 -XPOST http://<Elasticsearch Host>/_aliases \
 -d @alias.json
$ cat alias.json
{
  "actions": [
    {
      "add": {
        "index": "myindex_v2",
        "alias": "myindex"
      }
    }
  ]
}

Now the request to myindex is routed to myindex_v2. The benefit of using the alias is that we can avoid downtime and easily rollback the migration if there is something wrong in the new index. That’s because just switching the alias can be completed quickly. Thus overall, making sure all clients access via alias not index is a recommended pattern to make these kinds of operation possible.

Thanks!