CPU飚高排查

一、背景

近日发现EDI线上某应用为618新扩容的非公有云机器CPU经常99%,频繁触发机器CPU告警。该集群的其他机器则CPU使用率在10%以下。

二、排查步骤

1)确认该集群请求负载是否均衡

在ES里,通过请求饼图发现负载均衡,且问题机器请求数并不高。

ø 本步骤结论:

排除大流量导致的CPU飚高。

2)问题机器:执行top -Hp pid和jstack pid

当进程CPU为398%时,该进程CPU使用率高的线程CPU都在0.9以下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Top展示:(如果是docker展示数据为宿主物理机整体数据,略有偏差)
top - 03:15:41 up 225 days, 11:56, 0 users, load average: 14.07, 8.72, 7.11
Tasks: 35 total, 2 running, 27 sleeping, 0 stopped, 6 zombie
%Cpu(s): 14.7 us, 1.6 sy, 0.1 ni, 83.5 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13134801+total, 17604608 free, 34303548 used, 79439864 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 72823864 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
491335 admin 20 0 23.156g 4.909g 30008 S 398.0 3.9 2103:56 java
384 admin 20 0 375008 6512 4908 S 1.0 0.0 5:10.69 deploy
368428 admin 20 0 113256 3276 2996 S 1.0 0.0 0:00.01 bash
1 root 20 0 11644 2532 2332 S 0.0 0.0 0:04.34 sh
11 root 20 0 83060 3092 2220 S 0.0 0.0 0:01.33 sshd
13 root 20 0 22796 2564 1892 S 0.0 0.0 0:43.65 crond
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 线程情况占用如下:
%CPU PID TID TIME TTY STAT PSR
0.8 491335 495955 01:07:10 ? Sl 10
0.7 491335 495966 00:58:33 ? Sl 21
0.7 491335 495939 00:59:09 ? Sl 18
0.6 491335 495967 00:51:55 ? Sl 23
0.6 491335 495956 00:48:13 ? Sl 27
0.5 491335 495977 00:38:56 ? Sl 8
0.5 491335 495976 00:40:46 ? Sl 30
0.5 491335 491446 00:45:24 ? Sl 17
0.4 491335 495974 00:32:04 ? Sl 20
0.4 491335 495973 00:32:35 ? Sl 14
0.4 491335 495972 00:31:29 ? Sl 11
0.4 491335 495970 00:32:38 ? Sl 22
0.4 491335 495963 00:33:55 ? Sl 6
0.4 491335 495958 00:34:36 ? Sl 6
0.4 491335 495952 00:31:20 ? Sl 28
0.4 491335 495949 00:31:53 ? Sl 0
0.4 491335 495946 00:31:00 ? Sl 27
0.4 491335 495944 00:33:09 ? Sl 20
0.4 491335 495938 00:30:23 ? Sl 10

jstack中有一个明显不正常的地方是,其Parallel GC Threads线程有23个,而云上机器的GC线程数是2个。

根据文章所说,ParallelGCThreads = 8 + ((N - 8) * 5/8),反推得出N = 32核(此处显然是获取的物理机器核数,该docker只有4C)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
"VM Thread" os_prio=0 tid=0x00007f0cf41cf800 nid=0x77f6a runnable 

"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf403a800 nid=0x77f4c runnable

"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf403c000 nid=0x77f4d runnable

"Gang worker#2 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf403e000 nid=0x77f4e runnable

"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf403f800 nid=0x77f4f runnable

"Gang worker#4 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4041800 nid=0x77f50 runnable

"Gang worker#5 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4043800 nid=0x77f51 runnable

"Gang worker#6 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4045000 nid=0x77f52 runnable

"Gang worker#7 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4047000 nid=0x77f53 runnable

"Gang worker#8 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4049000 nid=0x77f54 runnable

"Gang worker#9 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf404a800 nid=0x77f55 runnable

"Gang worker#10 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf404c800 nid=0x77f56 runnable

"Gang worker#11 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf404e800 nid=0x77f57 runnable

"Gang worker#12 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4050000 nid=0x77f58 runnable

"Gang worker#13 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4052000 nid=0x77f59 runnable

"Gang worker#14 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4053800 nid=0x77f5a runnable

"Gang worker#15 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4055800 nid=0x77f5b runnable

"Gang worker#16 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4057800 nid=0x77f5c runnable

"Gang worker#17 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4059000 nid=0x77f5d runnable

"Gang worker#18 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf405b000 nid=0x77f5e runnable

"Gang worker#19 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf405d000 nid=0x77f5f runnable

"Gang worker#20 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf405e800 nid=0x77f60 runnable

"Gang worker#21 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4060800 nid=0x77f61 runnable

"Gang worker#22 (Parallel GC Threads)" os_prio=0 tid=0x00007f0cf4062800 nid=0x77f62 runnable

"Concurrent Mark-Sweep GC Thread" os_prio=0 tid=0x00007f0cf417b000 nid=0x77f69 runnable

"Gang worker#0 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf416f000 nid=0x77f63 runnable

"Gang worker#1 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf4171000 nid=0x77f64 runnable

"Gang worker#2 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf4172800 nid=0x77f65 runnable

"Gang worker#3 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf4174800 nid=0x77f66 runnable

"Gang worker#4 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf4176800 nid=0x77f67 runnable

"Gang worker#5 (Parallel CMS Threads)" os_prio=0 tid=0x00007f0cf4178000 nid=0x77f68 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f0cf5545000 nid=0x77f82 waiting on condition

JNI global references: 702

ø 本步骤结论:

可能是GC线程数过多导致,先定位GC线程数与云主机的差异原因。

3)确认是JDK版本导致GC线程数差异

经Google上文章所说JDK 8u131后才能感知到docker容器内的核数,之前是获取的物理机核数。

通过jstack的信息确定当前java版本:

  • 非云机器——问题机器
1
2
3
4
5
6
7
jstack信息:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.51-b03 mixed mode):

对应Java版本:
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
  • 云机器
1
2
3
4
5
6
7
jstack信息:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):

对应Java版本:
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b27)
Java HotSpot(TM) Client VM (build 25.60-b23, mixed mode)

ø 本步骤结论:

JDK版本不同导致Runtime.getAvailableProcessors获取的核数不同,从而使得ParallelGCThreads线程数不同。

4)问题机器:执行jstat,发现Young GC频繁

通过GC日志发现其youngGC非常频繁。

1
2
3
4
5
6
S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
99.64 99.61 61.62 80.11 83.44 80.94 3182 82.199 53 18.932 101.131
0.00 99.25 73.17 88.17 83.44 80.94 3186 82.345 53 18.932 101.278
99.87 0.00 58.43 95.48 83.44 80.94 3191 82.443 53 18.932 101.375
0.00 99.63 99.95 97.75 83.44 80.94 3192 82.464 54 18.932 101.396
99.90 0.00 99.99 10.72 83.44 80.94 3194 82.501 54 19.694 102.195

ø 本步骤结论:

YoungGC相当高频,先考虑把ParallelGCThreads设置docker的核数,然后再观察。

5)问题机器:设置-XX:ParallelGCThreads=4,问题未解决

ø 本步骤结论:

频繁GC的原因,肯定还是程序执行的问题。肯定是程序在执行期间频繁的申请内存并释放。于是,再次观察jstack堆栈内容。

6)问题机器:再次观察其堆栈,发现logger#debug调用

再次观察jstack堆栈,发现堆栈里有大量的logger#debug调用。检查外面配置的log4j2.xml的日志级别,确认是WARN级别,此时怀疑是slf4j多实现冲突导致。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"JSF-BZ-22000-12-T-680" #904 daemon prio=5 os_prio=0 tid=0x00007f0b8409c800 nid=0x79153 runnable [0x00007f088f9fd000]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622)
at java.lang.StringBuilder.append(StringBuilder.java:202)
at com.xxx.xxx.toString(StandardListResponse.java:66)
at java.lang.String.valueOf(String.java:2982)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at java.util.AbstractMap.toString(AbstractMap.java:536)
at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:299)
at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:271)
at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:233)
at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:173)
at org.slf4j.helpers.MessageFormatter.format(MessageFormatter.java:151)
at org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:247)
at org.slf4j.helpers.MarkerIgnoringBase.debug(MarkerIgnoringBase.java:79)
at com.xxx.xxx.xxx(Xxx.java:35)

ø 本步骤结论:

logger#debug方法有调用,定位是slf4j多实现冲突导致。而debug方法开销较大,有几处debug方法会触发contexMap#toString,导致YoungGC频繁。

7)问题机器:经验证是slf4j多实现冲突

验证log4j配置的日志目录及lib包,发现log4j包冲突,在云和非云机器log4j绑定的实现不一样。

ø 本步骤结论:

但slf4j有多实现包时,云主机和非云主机的表现不一致,选择的绑定实现不一致,导致了本次CPU飚高问题。

8)解决冲突,问题解决

三、结论

1)如果CPU是持续飚高,可以通过jstack分析原因;

2)如果CPU是间歇飚高,可以通过火眼图分析原因,火焰图可理解为jstack上增加了时间维度,偏差更小。

# 参考

  1. https://christopher-batey.medium.com/cpu-considerations-for-java-applications-running-in-docker-and-kubernetes-7925865235b7
  2. https://segmentfault.com/a/1190000009407521

评论