Hadoop NodeManager выход без журнала

Я обнаружил, что мой менеджер узла Hadoop не работает без причины и не имеет журнала.

1. информация о сервере

Red Hat Enterprise Linux Server, версия 7.2 (Maipo)

hadoop 2.8.0

namenode: Intel (R) Xeon (R) CPU E5-2650 v2 @ 2.60GHz с 32 ядрами, 64 ГБ оперативной памяти, 2 узла для namenode и 1 для resourceManager/hive

датодан: процессор Intel® Xeon® R E5-2620 v2 @ 2,10 ГГц с 24 ядрами, 128 ГБ оперативной памяти, всего 5 датоданных

nodemanager conf:

hadoop 233967 5,7 1,9 6424200 2600128? Sl 17:00 1:22 /opt/jdk/bin/java -Dproc_nodemanager -Xmx4096m -agentpath:/usr/lib/abrt-java-connector/libabrt-java-connector.so -agentlib:abrt-java-connector=output=/home/hadoop/hadoop/logs/abrt-agent.log -Xms4g -Xmx4g -Xmn3g -server -XX:SurvivorRatio=5 -XX:MetaspaceSize=256M -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -Xloggc:/home/hadoop/hadoop/logs/yarn_gc.log -XX:ErrorFile=/home/hadoop/hadoop/logs/error/yarn_hshoop_log-error.log.dir = / home / hadoop / hadoop / logs -Dyarn.log.dir = / home / hadoop / hadoop / logs -Dhadoop.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.home.dir = -Dyarn.id.str = hadoop -Dhadoop.root. logger = INFO, RFA -Dyarn.root.logger = INFO, RFA -Djava.library.path = / home / hadoop / hadoop / lib / native -Dyarn.policy.file = hadoop-policy.xml -server -Dhadoop.log.dir = / home / hadoop / hadoop / logs -Dyarn.log.dir = / home / hadoop / hadoop / logs -Dhadoop.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.home.dir = / home / hadoop / hadoop -Dhadoop.home.dir = / home / hadoop / hadoop -Dhadoop.root.logger = INFO, RFA -Dyarn.root.logger = INFO, RFA -Djava.library.path = / home / hadoop / hadoop / lib / native -classpath / home / hadoop / hadoop / etc / hadoop: / home / hadoop / hadoop / etc / hadoop: / home / hadoop / hadoop / etc / hadoop: / home / hadoop / hadoop / share / hadoop / common / lib / : / home / hadoop / hadoop / share / hadoop / common /: / home / hadoop / hadoop / share / hadoop / hdfs: / home / hadoop / hadoop / share / hadoop / hdfs / lib / : / home / hadoop / hadoop / share / hadoop / hdfs /: / home / hadoop / hadoop / share / hadoop / yarn / lib / : / home / hadoop / hadoop / share / hadoop / yarn /: / home / hadoop / hadoop / share / hadoop / mapreduce / lib / : / home / hadoop / hadoop / share / hadoop / mapreduce /: / home / hadoop / hadoop / contrib /acity-scheduler / .jar: / home / hadoop / hadoop / contrib / планировщик емкости /.jar: / home / hadoop / hadoop / share / hadoop / yarn / : / home / hadoop / hadoop / shar e / hadoop / yarn / lib /: /home/hadoop/hadoop/etc/hadoop/nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager

2. когда он падает

Если я выполню эти sql в улье, некоторые датододы могут дать сбой, я не знаю, является ли это причиной, но время сбоя кажется одинаковым.

select max(ts) from beacon where year = 2017 and month = 6 and day >= 21

delete from beacon where ts = 1498629599829

insert into table beacon partition(year,month,day) select type,page,name,ts,year,month,day from beacon_txt where ts >= 1498629599829

3. какие симптомы проявляются при аварии

1) процесс управления узлами исчезнет

2) нет исключений в журнале пряжи, фрагмент журнала данных в конце ниже:

2017-06-28 17:11:20,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://avatarcluster/user/hadoop/.hiveJars/hive-exec-2.1.1-5f4a7e952d29bb8013edd30bbc39476ec56bc381b96b0530a6b2fbbf28e309d3.jar(->/home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/filecache/10/hive-exec-2.1.1-5f4a7e952d29bb8013edd30bbc39476ec56bc381b96b0530a6b2fbbf28e309d3.jar) transitioned from DOWNLOADING to LOCALIZED

2017-06-28 17:11:20,701 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://avatarcluster/apps/tez-0.8.5.tar.gz(->/home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/filecache/10/tez-0.8.5.tar.gz) transitioned from DOWNLOADING to LOCALIZED

2017-06-28 17:11:20,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000006 transitioned from LOCALIZING to LOCALIZED

2017-06-28 17:11:20,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000011 transitioned from LOCALIZING to LOCALIZED

2017-06-28 17:11:20,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000006 transitioned from LOCALIZED to RUNNING

2017-06-28 17:11:20,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000011 transitioned from LOCALIZED to RUNNING

2017-06-28 17:11:20,744 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/appcache/application_1496722904961_18405/container_1496722904961_18405_01_000011/default_container_executor.sh]

2017-06-28 17:11:20,744 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/appcache/application_1496722904961_18405/container_1496722904961_18405_01_000006/default_container_executor.sh]

некоторые журналы отображаются в /var/log/messages

65270 Jun 28 17:11:11 hdp06 systemd: Removed slice user-0.slice.
65271 Jun 28 17:11:11 hdp06 systemd: Stopping user-0.slice.
65272 Jun 28 17:11:33 hdp06 abrt-server: Executable '/opt/jdk1.8.0_71/bin/java' doesn'      t belong to any package and ProcessUnpackaged is set to 'no'
65273 Jun 28 17:11:33 hdp06 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2017-0      6-28-17:11:20-375612' exited with 1
65274 Jun 28 17:11:33 hdp06 abrt-server: Deleting problem directory '/var/spool/abrt/c      cpp-2017-06-28-17:11:20-375612'
65275 Jun 28 17:11:43 hdp06 systemd-logind: Removed session 206241.
65276 Jun 28 17:12:01 hdp06 systemd: Created slice user-0.slice.
65277 Jun 28 17:12:01 hdp06 systemd: Starting user-0.slice.

Кажется, Java потерпела неудачу или выйдет нормально, но не имеет журнала.

4. coredump

Я добавляю опцию abrt-java-connector в yarn-env.sh

YARN_OPTS = "$ YARN_OPTS -agentpath: /usr/lib/abrt-java-connector/libabrt-java-connector.so -agentlib: abrt-java-connector = output = $ YARN_LOG_DIR / abrt-agent.log"

и он создает некоторый файл журнала сбоев в /var/spool/abrt/ccpp-2017-07-03-14:24:28-63796

-rw-r----- 1 root abrt          6 Jul  3 14:24 abrt_version
-rw-r----- 1 root abrt          4 Jul  3 14:24 analyzer
-rw-r----- 1 root abrt          6 Jul  3 14:24 architecture
-rw-r----- 1 root abrt        178 Jul  3 14:24 cgroup
-rw-r----- 1 root abrt       1974 Jul  3 14:24 cmdline
-rw-r----- 1 root abrt     380795 Jul  3 14:25 core_backtrace
-rw-r----- 1 root abrt 4887654400 Jul  3 14:24 coredump
-rw-r----- 1 root abrt          1 Jul  3 14:25 count
-rw-r----- 1 root abrt       1072 Jul  3 14:25 dso_list
-rw-r----- 1 root abrt       3318 Jul  3 14:24 environ
-rw-r----- 1 root abrt          0 Jul  3 14:25 event_log
-rw-r----- 1 root abrt         26 Jul  3 14:24 executable
-rw-r----- 1 root abrt         82 Jul  3 14:25 exploitable
-rw-r----- 1 root abrt          5 Jul  3 14:24 global_pid
-rw-r----- 1 root abrt         25 Jul  3 14:24 hostname
-rw-r----- 1 root abrt         21 Jul  3 14:24 kernel
-rw-r----- 1 root abrt         10 Jul  3 14:24 last_occurrence
-rw-r----- 1 root abrt       1323 Jul  3 14:24 limits
-rw-r----- 1 root abrt        135 Jul  3 14:25 machineid
-rw-r----- 1 root abrt      60706 Jul  3 14:24 maps
-rw-r----- 1 root abrt        243 Jul  3 14:24 open_fds
-rw-r----- 1 root abrt        495 Jul  3 14:24 os_info
-rw-r----- 1 root abrt         51 Jul  3 14:24 os_release
-rw-r----- 1 root abrt          5 Jul  3 14:24 pid
-rw-r----- 1 root abrt       1137 Jul  3 14:24 proc_pid_status
-rw-r----- 1 root abrt        149 Jul  3 14:24 pwd
-rw-r----- 1 root abrt         22 Jul  3 14:24 reason
-rw-r----- 1 root abrt          4 Jul  3 14:24 runlevel
-rw-r----- 1 root abrt   10746600 Jul  3 14:25 sosreport.tar.xz
-rw-r----- 1 root abrt         10 Jul  3 14:24 time
-rw-r----- 1 root abrt          4 Jul  3 14:24 type
-rw-r----- 1 root abrt          9 Jul  3 14:24 uid
-rw-r----- 1 root abrt          7 Jul  3 14:24 username
-rw-r----- 1 root abrt         40 Jul  3 14:25 uuid
-rw-r----- 1 root abrt     185370 Jul  3 14:25 var_log_messages

файл "причины" показывает "Java, убитый SIGSEGV"

если я выполню "gdb /opt/jdk/bin/java coredump", он показывает

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/jdk1.8.0_131/bin/java...Missing separate debuginfo for /opt/jdk1.8.0_131/bin/java
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c9/0f19ee0af98c47ccaa7181853cfd14867bc931.debug
(no debugging symbols found)...done.
[New LWP 63796]
[New LWP 62554]
[New LWP 62605]
[New LWP 62604]
[New LWP 62606]
[New LWP 62575]
[New LWP 62603]
[New LWP 62607]
[New LWP 62576]
[New LWP 62610]
[New LWP 62668]
[New LWP 62708]
[New LWP 62574]
[New LWP 63790]
[New LWP 63717]
[New LWP 63739]
[New LWP 62580]
[New LWP 62579]
[New LWP 63738]
[New LWP 62592]
[New LWP 62577]
[New LWP 62583]
[New LWP 63740]
[New LWP 62581]
[New LWP 62688]
[New LWP 62587]
[New LWP 62597]
[New LWP 63744]
[New LWP 63737]
[New LWP 62591]
[New LWP 62578]
[New LWP 62582]
[New LWP 62758]
[New LWP 62573]
[New LWP 62626]
[New LWP 63720]
[New LWP 63719]
[New LWP 63726]
[New LWP 63727]
[New LWP 62598]
[New LWP 62774]
[New LWP 62705]
[New LWP 62614]
[New LWP 62703]
[New LWP 62593]
[New LWP 62720]
[New LWP 62590]
[New LWP 62690]
[New LWP 63731]
[New LWP 63810]
[New LWP 63724]
[New LWP 62585]
[New LWP 62753]
[New LWP 62682]
[New LWP 62709]
[New LWP 62684]
[New LWP 62773]
[New LWP 62588]
[New LWP 63722]
[New LWP 62595]
[New LWP 62734]
[New LWP 62616]
[New LWP 62728]
[New LWP 62721]
[New LWP 62689]
[New LWP 62769]
[New LWP 62659]
[New LWP 63743]
[New LWP 62726]
[New LWP 62680]
[New LWP 62704]
[New LWP 62750]
[New LWP 63759]
[New LWP 62594]
[New LWP 63791]
[New LWP 62768]
[New LWP 62600]
[New LWP 63741]
[New LWP 62613]
[New LWP 63718]
[New LWP 62710]
[New LWP 62589]
[New LWP 62731]
[New LWP 63735]
[New LWP 62683]
[New LWP 62760]
[New LWP 63801]
[New LWP 62776]
[New LWP 62678]
[New LWP 62615]
[New LWP 62685]
[New LWP 62737]
[New LWP 62599]
[New LWP 63742]
[New LWP 63808]
[New LWP 62755]
[New LWP 62707]
[New LWP 62694]
[New LWP 63729]
[New LWP 63755]
[New LWP 62711]
[New LWP 63725]
[New LWP 63732]
[New LWP 62745]
[New LWP 62596]
[New LWP 62608]
[New LWP 62735]
[New LWP 63721]
[New LWP 62748]
[New LWP 62736]
[New LWP 62712]
[New LWP 63756]
[New LWP 63793]
[New LWP 63787]
[New LWP 63803]
[New LWP 62602]
[New LWP 62743]
[New LWP 62733]
[New LWP 62742]
[New LWP 63710]
[New LWP 62744]
[New LWP 62677]
[New LWP 62739]
[New LWP 62713]
[New LWP 63789]
[New LWP 62601]
[New LWP 63812]
[New LWP 62725]
[New LWP 62724]
[New LWP 63709]
[New LWP 62718]
[New LWP 62759]
[New LWP 62686]
[New LWP 62715]
[New LWP 62740]
[New LWP 62655]
[New LWP 62749]
[New LWP 62722]
[New LWP 63708]
[New LWP 62716]
[New LWP 63800]
[New LWP 62687]
[New LWP 62723]
[New LWP 63733]
[New LWP 62609]
[New LWP 62738]
[New LWP 63707]
[New LWP 62719]
[New LWP 62714]
[New LWP 62691]
[New LWP 62780]
[New LWP 62625]
[New LWP 62778]
[New LWP 63788]
[New LWP 62717]
[New LWP 63802]
[New LWP 62681]
[New LWP 62692]
[New LWP 62730]
[New LWP 63736]
[New LWP 62679]
[New LWP 62693]
[New LWP 63728]
[New LWP 62697]
[New LWP 62729]
[New LWP 62746]
[New LWP 62698]
[New LWP 62747]
[New LWP 63734]
[New LWP 62727]
[New LWP 62695]
[New LWP 62675]
[New LWP 62676]
[New LWP 63711]
[New LWP 63713]
[New LWP 62699]
[New LWP 62752]
[New LWP 62700]
[New LWP 63723]
[New LWP 62706]
[New LWP 62756]
[New LWP 63706]
[New LWP 62702]
[New LWP 63751]
[New LWP 62658]
[New LWP 62779]
[New LWP 62754]
[New LWP 62771]
[New LWP 62701]
[New LWP 62751]
[New LWP 63730]
[New LWP 62612]
[New LWP 62696]
[New LWP 62611]
[New LWP 62757]
[New LWP 62761]
[New LWP 62732]
[New LWP 62772]
[New LWP 62741]
[New LWP 62777]
[New LWP 62775]
[New LWP 62770]
[New LWP 63792]
[New LWP 62586]
[New LWP 62584]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/78/b091327a0bf6d146f8881f285955b4f7f2b712.debug
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/libverify.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/89/8659e6261ad966b4b638afc8e3dd214896253d.debug
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/libmanagement.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/88/c0c64eb685c329ad849281135cfe113f3812e8.debug
Core was generated by `/opt/jdk/bin/java -Dproc_nodemanager -Xmx4096m -Xms4g -Xmx4g -Xmn3g -server -XX'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2f4460bfd7 in VMError::report_and_die() ()
   from /opt/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 sssd-client-1.13.0-40.el7.x86_64

Так какая-нибудь возможная причина?

0 ответов

Другие вопросы по тегам