grepe
grepe

Reputation: 1977

Dockerized MongoDB keeps crashing during long writes to a capped collectin (SEGFAULT)

We run a big MongoDB instance (cca 500GB of data) with some logs (non-critical data, JSON documents with quite variable format) and we need to regularly delete oldest records. We decided to move the data to a capped collection with a fixed size. So we set up a new mongo instance (for compatibility reasons MongoDB version 3.2.14), created a collection and indexes in it and started a job to copy the documents from the old mongo into the new one in chronological order.

The script to copy the data looks like this:

import pymongo
from pymongo.errors import BulkWriteError
import traceback

src_mongo = pymongo.MongoClient('old_mongo_ip')
src = src_mongo['db_name']['collection_name']

dst_mongo = pymongo.MongoClient('localhost')
dst = dst_mongo['db_name']['collection_name']

bulk = dst.initialize_unordered_bulk_op()
count = 0
total = 0
for doc in src.find().sort([("collector_tstamp",1)]):
    bulk.insert(doc)
    count += 1
    if (count > 1000):
        try:
            result = bulk.execute()
            total += result['nInserted']
        except BulkWriteError as err:
            traceback.print_last()
            total += err.details['nInserted']
        finally:
            bulk = dst.initialize_unordered_bulk_op()
            count = 0
    print(str(total)+"\r",end="")
if (count > 0):
    try:
        bulk.execute()
        total += result['nInserted']
    except:
        traceback.print_last()
print(str(total))

The problem is, that the job takes long (not surprising given this setup) and the new mongo keeps crashing with SEGFAULT after a few hours of copying.

The new mongo runs in a docker container on an EC2 instance (m4.large, same instance as the script above) and is storing the data to an EBS volume (GP2 SSD). There are no hints to the reason of the crash, other than the stack trace in the mongod.log file:

2017-07-22T01:50:34.452Z I COMMAND  [conn5] command db_name.collection_name command: insert { insert: "collection_name", ordered: false, documents: 1000 } ninserted:1000 keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } }, Metadata: { acquireCount: { w: 1000, W: 1000 } } } protocol:op_query 318ms
2017-07-22T01:50:34.930Z F -        [thread1] Invalid access at address: 0x78
2017-07-22T01:50:34.994Z F -        [thread1] Got signal: 11 (Segmentation fault).

 0x154e4f2 0x154d499 0x154de77 0x7f22f6862390 0x7f22f685c4c0 0x1bbe09b 0x1bc2305 0x1c10f7a 0x1c0bd53 0x1c0c0d7 0x1c0d9e0 0x1c74406 0x7f22f68586ba 0x7f22f658e82d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"114E4F2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"114D499"},{"b":"400000","o":"114DE77"},{"b":"7F22F6851000","o":"11390"},{"b":"7F22F6851000","o":"B4C0","s":"__pthread_mutex_unlock"},{"b":"400000","o":"17BE09B"},{"b":"400000","o":"17C2305","s":"__wt_split_multi"},{"b":"400000","o":"1810F7A","s":"__wt_evict"},{"b":"400000","o":"180BD53"},{"b":"400000","o":"180C0D7"},{"b":"400000","o":"180D9E0","s":"__wt_evict_thread_run"},{"b":"400000","o":"1874406","s":"__wt_thread_run"},{"b":"7F22F6851000","o":"76BA"},{"b":"7F22F6488000","o":"10682D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.14", "gitVersion" : "92f6668a768ebf294bd4f494c50f48459198e6a3", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-1022-aws", "version" : "#31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "B04D4C2514E2C891B5791D71A8F4246ECADF157D" }, { "b" : "7FFF43146000", "elfType" : 3, "buildId" : "1AD367D8FF756A82AA298AB1CC9CD893BB5C997C" }, { "b" : "7F22F77DD000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "675F454AD6FD0B6CA2E41127C7B98079DA37F7B6" }, { "b" : "7F22F7399000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "2DA08A7E5BF610030DD33B70DB951399626B7496" }, { "b" : "7F22F7191000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "0DBB8C21FC5D977098CA718BA2BFD6C4C21172E9" }, { "b" : "7F22F6F8D000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C0C5B7F18348654040534B050B110D32A19EA38D" }, { "b" : "7F22F6C84000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "05451CB4D66C321691F64F253880B7CE5B8812A6" }, { "b" : "7F22F6A6E000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F22F6851000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "84538E3C6CFCD5D4E3C0D2B6C3373F802915A498" }, { "b" : "7F22F6488000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CBFA941A8EB7A11E4F90E81B66FCD5A820995D7C" }, { "b" : "7F22F7A46000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "A7D5A820B802049276B1FC26C8E845A3E194EB6B" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x154e4f2]
 mongod(+0x114D499) [0x154d499]
 mongod(+0x114DE77) [0x154de77]
 libpthread.so.0(+0x11390) [0x7f22f6862390]
 libpthread.so.0(__pthread_mutex_unlock+0x0) [0x7f22f685c4c0]
 mongod(+0x17BE09B) [0x1bbe09b]
 mongod(__wt_split_multi+0x85) [0x1bc2305]
 mongod(__wt_evict+0x8FA) [0x1c10f7a]
 mongod(+0x180BD53) [0x1c0bd53]
 mongod(+0x180C0D7) [0x1c0c0d7]
 mongod(__wt_evict_thread_run+0xC0) [0x1c0d9e0]
 mongod(__wt_thread_run+0x16) [0x1c74406]
 libpthread.so.0(+0x76BA) [0x7f22f68586ba]
 libc.so.6(clone+0x6D) [0x7f22f658e82d]
-----  END BACKTRACE  -----

I tried searching around, but I could not find any possible solution... did anyone encounter similar problem and did you find a reason why it was happening?

Upvotes: 0

Views: 2343

Answers (1)

helmbert
helmbert

Reputation: 38024

This looks very much like the known MongoDB bug SERVER-29850, which describes this exact behaviour and was fixed in 3.2.15:

A bug in the algorithm to do page splitting in the WiredTiger storage engine may trigger a segmentation fault, causing a node to shut down defensively to protect user data. [...]

The bug manifests itself with a message in the logs similar to the one below:

2017-06-23T19:03:29.043+0000 F -        [thread1] Invalid access at address: 0x78
2017-06-23T19:03:29.073+0000 F -        [thread1] Got signal: 11 (Segmentation fault).

----- BEGIN BACKTRACE -----
[...]
 mongod(+0x160C2BB) [0x1a0c2bb]
 mongod(__wt_split_multi+0x85) [0x1a105e5]
 mongod(__wt_evict+0xA55) [0x1a5eac5]
[...]
-----  END BACKTRACE  -----

My suggestion would be to upgrade from MonogDB version 3.2.14 to 3.2.15.

Upvotes: 2

Related Questions