索引备份实践

一、备份目标

1)备份原因

迁移备份线上ES数据到备份ES,后期计算TPS、单量等数据。

2)优化措施

目标:减小存储压力和提高迁移速度。

1、备份索引不设置副本

备份数据只用来统计分析,不需要高可用。

2、只迁移spanId为end的日志结点

备份内容只关注session数据(spanID),不关心具体处理内容。

3、迁移部分字段,过滤extend_1,extend_2,extend_3等字段,source大小变为1/10

用于统计分析,去除无关大字段。

二、备份过程

从源ES备份到目标ES,备份过程只在目标es操作。

1)建立和源ES相同的索引

可通过模板或者Mapping创建。

1
2
3
4
5
6
7
8
9
$ POST _template/xxx_tpl
{
"settings" : {
"index" : {
"number_of_replicas" : "0" # 副本修改为0,减少备份容量
}
},
...
}

2)配置reindex.remote.whitelist

reindex.remote.whitelist为源ES地址。

3)登陆目标ES,执行脚本

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-reindex.html

可执行多个索引迁移任务,多个reindex任务会并行执行,例如以10天为一个区间进行迁移;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# wait_for_completion把任务设为异步任务,避免中断
$ POST _reindex?wait_for_completion=false
{
"source": {
"remote": {
"host": "http://source ES:9200", # 源ES
"username": "xxx",
"password": "xxx"
},
"index": "xxx_2021.05.30",
"_source": ["appCode", "traceId", ... ], # 非全量迁移字段,减小存储压力和迁移速度
"size": 10000, # size字段取值参考下一节"reindex-size取值优化"
"query": { # 增加查询条件,非全量迁移document
"match": {
"spanId": "end"
}
}
},
"dest": {
"index": "xxx_2021.05.30",
"op_type": "create"
}
}

4)查看跟踪任务

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/tasks.html

1
2
3
4
5
6
# 查看任务
$ GET _tasks/[taskID]
$ GET _tasks?detailed=true&actions=*reindex

# 取消任务
$ POST _tasks/[taskID]/_cancel

三、reindex-size取值优化

size设置要合理,参考两个原则:

1)size <= 10000

2)size*docNum < bufferSize

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
{
"completed" : true,
"task" : {
"node" : "mxMIBxgNS3ek5URhsxmQkw",
"id" : 2964624,
"type" : "transport",
"action" : "indices:data/write/reindex",
"status" : {
"total" : 0,
"updated" : 0,
"created" : 0,
"deleted" : 0,
"batches" : 0,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : """
reindex from [host=es-nlb-xxx.xxx.com port=9200 query={
"match" : {
"spanId" : "end"
}
} username=xxx password=<<>>][xxx_2021.05.30] to [xxx_2021.05.30]
""",
"start_time_in_millis" : 1624958356693,
"running_time_in_nanos" : 778599920,
"cancellable" : true,
"headers" : { }
},
"error" : {
"type" : "illegal_argument_exception",
"reason" : "Remote responded with a chunk that was too large. Use a smaller batch size.",
"caused_by" : {
"type" : "content_too_long_exception",
"reason" : "entity content is too long [118324742] for the configured buffer limit [104857600]"
}
}
}

四、Reindex失败定位

1)定位方式

1、任务索引查询:.task/task/[taskID]

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-reindex.html

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses.

2、任务API查询:GET _tasks?detailed=true&actions=*reindex

3、根据ES运行日志定位

2)定位案例

0、前置操作

最近迁移索引时,迁移到一半reindex任务就中断了,且前两种方式没有记录。遂观察ES日志。

1、查看目标ES日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
2021-11-12 20:18:23  [2021-11-12T20:18:16,964][INFO ][c.j.e.p.a.a.AuthActionFilter] [coordinating-1] action:'indices:data/write/index' require Authorization
2021-11-12 20:18:23 at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.tasks.TaskManager.storeResult(TaskManager.java:203) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:76) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 [2021-11-12T20:18:16,965][WARN ][o.e.t.LoggingTaskListener] [coordinating-1] 376642151 failed with exception
2021-11-12 20:18:23 at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:395) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.index.reindex.remote.RemoteScrollableHitSource.lambda$cleanup$2(RemoteScrollableHitSource.java:149) ~[?:?]
2021-11-12 20:18:23 org.elasticsearch.ElasticsearchSecurityException: Authorization required
2021-11-12 20:18:23 at org.elasticsearch.tasks.TaskResultsService.storeResult(TaskResultsService.java:134) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:87) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.tasks.TaskResultsService.doStoreResult(TaskResultsService.java:159) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at com.jdcloud.es.plugin.authentication.action.AuthActionFilter.apply(AuthActionFilter.java:159) ~[?:?]
2021-11-12 20:18:23 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:71) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:71) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
2021-11-12 20:18:23 at org.elasticsearch.action.support.TransportAction$TaskResultStoringActionListener.onFailure(TransportAction.java:205) ~[elasticsearch-6.5.4.jar:6.5.4]
2021-11-12 20:18:23 org.elasticsearch.ElasticsearchSecurityException: Authorization required
2021-11-12 20:18:23 [2021-11-12T20:18:16,964][WARN ][o.e.t.TaskManager ] [coordinating-1] couldn't store error SocketTimeoutException[null]
2021-11-12 20:22:51 [2021-11-12T20:22:44,620][INFO ][o.e.c.m.MetaDataIndexTemplateService] [master-0] adding template [kibana_index_template:.kibana] for index patterns [.kibana]

从异常日志中看到reindex的任务出现了couldn't store error SocketTimeoutException

2、查看源ES日志

1
2021-11-12 20:18:19  [2021-11-12T20:18:16,955][WARN ][o.e.t.TransportService ] [master-1] Received response for a request that has timed out, sent [36620ms] ago, timed out [6603ms] ago, action [internal:discovery/zen/fd/ping], node [{node-6}{luQIob0aT-63YllGaZ7ecA}{3gE6HKHAS72YtgubIzZEvQ}{11.77.76.176}{11.77.76.176:9300}{zone_id=az1}], id [208747742]

3、猜测原因

由源ES和目标ES的异常信息可推测出应该是请求源ES超时导致任务中断。

4、修改验证

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-reindex.html

It is also possible to set the socket read timeout on the remote connection with the socket_timeout field and the connection timeout with the connect_timeout field. Both default to 30 seconds.

根据文档描述增加超时参数,修改后迁移语句如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ POST _reindex?wait_for_completion=false
{
"source": {
"remote": {
"host": "http://source ES:9200",
"socket_timeout": "2m", # 增加超时参数,2min
"username": "xxx",
"password": "xxx"
},
"index": "xxx_2021.05.30",
"_source": ["appCode", "traceId", ... ],
"size": 10000,
"query": {
"match": {
"spanId": "end"
}
}
},
"dest": {
"index": "xxx_2021.05.30",
"op_type": "create"
}
}

修改后4800W数据耗时3小时迁移成功。

# 参考

  1. https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-reindex.html
  2. https://www.elastic.co/guide/en/elasticsearch/reference/6.8/tasks.html
  3. https://www.elastic.co/guide/en/elasticsearch/reference/6.0/reindex-upgrade-remote.html
  4. https://discuss.elastic.co/t/not-whitelisted-in-reindex-remote-whitelist/111070
  5. 提高reindex效率

评论