asi345
asi345

Reputation: 11

CRIU - Restoring an executor Java process of Spark inside other containers

I use Kubernetes to run worker containers as nodes in my Spark setup. These worker containers run executors, which are java processes. My purpose is to checkpoint executor processes with CRIU, to be resumed later.

Now, criu restore works perfect when done inside the same container where criu dump was called. However, when I call criu restore on a different container from which criu dump was called, then I face with problems in JVM.

Firstly, I want to state that there is no error encountered in CRIU, meaning there is no error log from CRIU and it behaves like the process resumed successfully.(No Restoring FAILED message and the calling terminal blocks after restore was called). On the other hand, Java process does not behave like it supposed to and does not initiate anything, even though I see it clearly when I call ps axf: Java process is there as a child of CRIU restore process.

When I look at the logs of the executor Java process after restoring, I encounter this:

Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000002
Handler1=00007F1D89A42550 Handler2=00007F1D8979FB30 InaccessibleAddress=00007F1D4FA5EB1B
RDI=0000000000000000 RSI=00000000006CA040 RAX=00007F1D4FA5EB1B RBX=00000000006CA040
RCX=00007F1D4C3D210B RDX=000000000065D400 R8=00007F1D4C3D2075 R9=00007F1D840560F0
R10=00007F1D89BB2010 R11=00007F1D89BB2010 R12=00000000006CA040 R13=000000000065D400
R14=00007F1D4C079CB8 R15=00007F1D6E8A8900
RIP=00007F1D89AF472F GS=0000 FS=0000 RSP=00007F1D6E8A83D0
EFlags=0000000000010206 CS=0033 RBP=0000000000000002 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007F1D4FA5EB1B
Invalid UTF8 character at 00007F1CA800BF27, 62Invalid UTF8 character at 00007F1C9C005757, 6214:10:04.039 0x65f400    j9mm.141    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/build-scripts/jobs/jdk8u/jdk8u-linux-x64-openj9/workspace/build/src/openj9/runtime/gc_base/StringTable.cpp:663: ((false))
14:10:04.039 0x65e000    j9mm.141    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/build-scripts/jobs/jdk8u/jdk8u-linux-x64-openj9/workspace/build/src/openj9/runtime/gc_base/StringTable.cpp:663: ((false))
xmm0 00000000004099f8 (f: 4233720.000000, d: 2.091736e-317)
xmm1 3736373520202020 (f: 538976256.000000, d: 9.961844e-43)
xmm2 0000000000ff0000 (f: 16711680.000000, d: 8.256667e-317)
xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4 00000000000000ff (f: 255.000000, d: 1.259867e-321)
xmm5 6176616a5f617661 (f: 1600222848.000000, d: 3.146502e+161)
xmm6 6f6e5f7463656a62 (f: 1667590784.000000, d: 5.756156e+228)
xmm7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8 0031000011000f00 (f: 285216512.000000, d: 9.456564e-308)
xmm9 ffffff0000000000 (f: 0.000000, d: -nan)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=<path to jdk>/jdk/jre/lib/amd64/default/libj9vm29.so
Module_base_address=00007F1D89A03000
Target=2_90_20220128_306 (Linux 5.15.0-106-generic)
CPU=amd64 (32 logical CPUs) (0x178a97e000 RAM)
----------- Stack Backtrace -----------
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F1D89A42550 Handler2=00007F1D8979FB30 InaccessibleAddress=0000000000000010
RDI=00007F1D4C1B87B2 RSI=00007F1D4C1B87B2 RAX=0000000000000000 RBX=0000000000000000
RCX=00007F1CAC001203 RDX=00000000000022FF R8=0000000000000000 R9=0000000000000000
R10=000000000000000A R11=00007F1D4C1674A4 R12=0000000000030750 R13=00000000001AEE00
R14=00007F1D4C1670EC R15=00007F1D6C9A3900
RIP=00007F1D89ACCCF4 GS=0000 FS=0000 RSP=00007F1D6C9A33D0
EFlags=0000000000010206 CS=0033 RBP=00007F1D4C1B87B1 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000010
(0x00007F1D89AF472F [libj9vm29.so+0xf172f])
(0x00007F1D89ACC8CD [libj9vm29.so+0xc98cd])
(0x00007F1D89B3FF42 [libj9vm29.so+0x13cf42])
---------------------------------------
xmm0 00000000ffffffff (f: 4294967296.000000, d: 2.121996e-314)
xmm1 6c2f6176616a4c28 (f: 1634356224.000000, d: 1.320535e+213)
xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3 3ef0b576d163fab9 (f: 3512990464.000000, d: 1.593480e-05)
xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5 4547415353454d5f (f: 1397050752.000000, d: 5.622757e+25)
xmm6 2e53555f6e652f65 (f: 1852125056.000000, d: 1.555004e-85)
xmm7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8 0000000000078e40 (f: 495168.000000, d: 2.446455e-318)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 2020000020202000 (f: 538976256.000000, d: 5.966673e-154)
xmm11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=<path to jdk>/jdk/jre/lib/amd64/default/libj9vm29.so
Module_base_address=00007F1D89A03000
Target=2_90_20220128_306 (Linux 5.15.0-106-generic)
CPU=amd64 (32 logical CPUs) (0x178a97e000 RAM)
----------- Stack Backtrace -----------
(0x00007F1D89ACCCF4 [libj9vm29.so+0xc9cf4])
(0x00007F1D89ACC8CD [libj9vm29.so+0xc98cd])
(0x00007F1D89B3FF42 [libj9vm29.so+0x13cf42])
---------------------------------------
JVMDUMP039I Processing dump event "traceassert", detail "" at 2024/07/02 14:10:04 - please wait.
JVMDUMP039I Processing dump event "traceassert", detail "" at 2024/07/02 14:10:04 - please wait.
JVMDUMP039I Processing dump event "gpf", detail "" at 2024/07/02 14:10:04 - please wait.
JVMDUMP039I Processing dump event "gpf", detail "" at 2024/07/02 14:10:04 - please wait.
JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20240702.141004.231.0004.txt' in response to an event
JVMDUMP012E Error in Java dump: /tmp/javacore.20240702.141004.231.0004.txt

I also extracted Java dump that is mentioned above. Since it is a bit long to paste all here, I will just share the section that the problem is mentioned:

...
2HKEVENTID     4
3HKCALLCOUNT     755
3HKTOTALTIME     693us
3HKLAST          Last Callback
4HKCALLSITE        /home/jenkins/workspace/build-scripts/jobs/jdk8u/jdk8u-linux-x64-openj9/workspace/build/src/openj9/runtime/jcl/common/mgmtinit.c:196
4HKSTARTTIME       Start Time: 2024-07-02T14:08:44.274
4HKDURATION        Duration: 1us
3HKLONGST        Longest Callback
4HKCALLSITE        /home/jenkins/workspace/build-scripts/jobs/jdk8u/jdk8u-linux-x64-openj9/workspace/build/src/openj9/runtime/jcl/common/mgmtinit.c:196
4HKSTARTTIME       Start Time: 2024-07-02T14:03:54.272
4HKDURATION        Duration: 13us
NULL
NULL           ------------------------------------------------------------------------
1INTERNAL      In-flight data encountered. Output may be missing or incomplete.
0SECTION       CLASSES subcomponent dump routine
NULL           =================================
1CLTEXTCLLOS    Classloader summaries
1CLTEXTCLLSS        12345678: 1=primordial,2=extension,3=shareable,4=middleware,5=system,6=trusted,7=application,8=delegating
2CLTEXTCLLOADER     p---st-- Loader *System*(0x00000000F8026A58)
3CLNMBRLOADEDLIB        Number of loaded libraries 6
3CLNMBRLOADEDCL         Number of loaded classes 1538
3CLNMBRSHAREDCL         Number of shared classes 1409
2CLTEXTCLLOADER     ----st-- Loader java/lang/InternalAnonymousClassLoader(0x00000000F8026DE0), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 58
3CLNMBRSHAREDCL         Number of shared classes 58
2CLTEXTCLLOADER     -----ta- Loader sun/misc/Launcher$AppClassLoader(0x00000000F8096D90), Parent sun/misc/Launcher$ExtClassLoader(0x00000000F8032188)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1472
2CLTEXTCLLOADER     -x--st-- Loader sun/misc/Launcher$ExtClassLoader(0x00000000F8032188), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 0
2CLTEXTCLLOADER     ----st-- Loader sun/reflect/DelegatingClassLoader(0x00000000F9C15468), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
2CLTEXTCLLOADER     ----st-- Loader sun/reflect/DelegatingClassLoader(0x00000000FB5A8D38), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
2CLTEXTCLLOADER     ----st-- Loader sun/reflect/DelegatingClassLoader(0x00000000FB6AAEC0), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
2CLTEXTCLLOADER     -----t-- Loader sun/reflect/DelegatingClassLoader(0x00000000FB930898), Parent sun/misc/Launcher$AppClassLoader(0x00000000F8096D90)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
2CLTEXTCLLOADER     -----t-- Loader sun/reflect/DelegatingClassLoader(0x00000000FB933250), Parent sun/misc/Launcher$AppClassLoader(0x00000000F8096D90)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
2CLTEXTCLLOADER     ----st-- Loader sun/reflect/DelegatingClassLoader(0x00000000FC429E88), Parent *none*(0x0000000000000000)
3CLNMBRLOADEDLIB        Number of loaded libraries 0
3CLNMBRLOADEDCL         Number of loaded classes 1
1CLTEXTCLLIB    ClassLoader loaded libraries
2CLTEXTCLLIB        Loader *System*(0x00000000F8026A58)
3CLTEXTLIB              <path to jdk>/jdk/jre/lib/amd64/default/jclse29
3CLTEXTLIB              <path to jdk>/jdk/jre/lib/amd64/java
3CLTEXTLIB              <path to jdk>/jdk/jre/lib/amd64/zip
3CLTEXTLIB              <path to jdk>/jre/lib/amd64/default/j9jit29
3CLTEXTLIB              <path to jdk>/jre/lib/amd64/nio
3CLTEXTLIB              <path to jdk>/jdk/jre/lib/amd64/net
1CLTEXTCLLOD    ClassLoader loaded classes
2CLTEXTCLLOAD       Loader *System*(0x00000000F8026A58)
3CLTEXTCLASS            [L(0x00000000006E6300)
3CLTEXTCLASS            [unknown]
2CLTEXTCLLOAD       Loader java/lang/InternalAnonymousClassLoader(0x00000000F8026DE0)
3CLTEXTCLASS            1INTERNAL      In-flight data encountered. Output may be missing or incomplete.
0SECTION       Javadump End section
NULL           ---------------------- END OF DUMP -------------------------------------

Does anyone have an idea what this in-flight data means? What component was failed to be captured by CRIU?

For clarity, I also wanted to mention that across different restores, it always fails inside CLASSES subcomponent dump routine, although the exact line of the fail is subject to change.

Note 1 : My CRIU commands look like this : criu dump -t <pid of executor> -D <folder> --file-locks --tcp-established, criu restore -D <folder> --file-locks --tcp-established. I always restore the iptables state, which was saved right after criu dump, in the container where criu restore is performed just before calling criu restore. Additionally, Java process is started with setsid, thus --shell-job option is not used.

Note 2 : I have intentionally hidden jdk path prefixes, but that has nothing to do with the problem I encounter.

I also tried running criu with a variety of options : --tcp-close, --restore-detached, --restore-sibling, --evasive-devices. For a possible pid collision or mismatch problem, I also tried running with criu-ns. None of them made any difference.

I was expecting Java executor process to run normally and connect to the system when I saw that criu restore was successful.

Upvotes: 1

Views: 68

Answers (0)

Related Questions