gwbcho
gwbcho

Reputation: 1

Need assistance with hanging large MySQL transaction

I'm a bit new to this sort of thing so I apologize in advance for my ignorance. My team and I are writing an unsupervised nonparametric method to cluster streamed elements with feature vectors being stored in a MySQL 5.7 database for later recall. The feature vectors are broken up into several components by the algorithm (the detailed reasoning why we do this is inconsequential for this question) resulting in multiple vectors stored in the database in a single datatable which describes the inputted element. These vectors can vary in size but I believe the largest one has a dimension of around 1x60000. We store each feature vector in the same datatable to preserve 4th Normal Form in our database design.

The issue is, when we automate the inserting of this vector we run into a problem with the transactions where the locks persist for an unusual amount of time. I believe it could be a deadlock issue but can't seem to figure out where the deadlock could be. The operations shouldn't block each other indefinitely (operational order: insert general information to info datatable -> insert component 1 feature vectors into info datatable -> insert component 2 feature vectors into info datatable -> ...). Note that even if one of the transactions block it should ~eventually~ resolve and the next transaction should begin. This leads me to my next hypothesis, it would seem that each individual transaction is too large resulting in the tables locking for a large amount of time. I've heard that this can cause issues even if there is no logical reason for a deadlock to occur. The following is the result from me running SHOW ENGINE INNODB STATUS; in my database docker container. Any help or advice would be appreciated.

Other information: we're using python3.7 with numpy. Each feature vector is a numpy array object that is stringified and stored as a LONGBLOB in the database. The information is stored sequentially as a result of the underlying algorithm and also to reduce the load on the db servers (we wanted to avoid this issue to begin with). We've thought about removing the vector storage mechanism entirely, but, it makes numerous operations and algorithms later down the line significantly faster -- not to mention easier to write -- to simply recall this information as opposed to reconstructing the vector from raw database information about the inputted item's feature set (imagine recalculating the one hot vector for your training set at each batch... not ideal).

------------
TRANSACTIONS
------------
Trx id counter 5848
Purge done for trx's n:o < 5800 undo n:o < 0 state: running but idle
History list length 66
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421274805708992, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805702552, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805704392, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805703472, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805701632, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805699792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805698872, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805694272, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805693352, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805692432, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805688752, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805687832, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805685072, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805684152, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805683232, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805682312, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805680472, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805678632, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805677712, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805676792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805675872, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805686912, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805697952, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805697032, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805696112, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805695192, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805691512, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805690592, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805689672, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805685992, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421274805681392, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 5845, ACTIVE 1960 sec
6 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 18460, OS thread handle 139799096174336, query id 110111 172.18.0.17 root
---TRANSACTION 5839, ACTIVE 3374 sec
1 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 18561, OS thread handle 139799094281984, query id 110572 172.18.0.20 root
Trx read view will not see trx with id >= 5840, sees < 5799
---TRANSACTION 5838, ACTIVE 3476 sec
1 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 18540, OS thread handle 139799094552320, query id 108421 172.18.0.20 root
Trx read view will not see trx with id >= 5839, sees < 5799
---TRANSACTION 5837, ACTIVE 3577 sec
1 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 18519, OS thread handle 139799095092992, query id 108310 172.18.0.20 root
Trx read view will not see trx with id >= 5838, sees < 5799
---TRANSACTION 5835, ACTIVE 3628 sec
1 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 18497, OS thread handle 139799094822656, query id 108175 172.18.0.20 root
Trx read view will not see trx with id >= 5835, sees < 5799
---TRANSACTION 5802, ACTIVE 3832 sec
1 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 11, OS thread handle 139799379732224, query id 107906 172.18.0.20 root
Trx read view will not see trx with id >= 5799, sees < 5796
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
501 OS file reads, 8143 OS file writes, 4776 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 12 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 44613787
Log flushed up to   44613787
Pages flushed up to 44613787
Last checkpoint at  44613778
0 pending log flushes, 0 pending chkp writes
3412 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137428992
Dictionary memory allocated 924661
Buffer pool size   8191
Free buffers       5612
Database pages     2556
Old database pages 923
Modified db pages  0
Pending reads      0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 410, created 2146, written 3707
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 2556, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
5 read views open inside InnoDB
Process ID=1, Main thread ID=139799528855296, state: sleeping
Number of rows inserted 187012, updated 52, deleted 4, read 102952428
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

The following is the error we get from our automated vector insert system.

score-service_1                 | ERROR:root:An error occurred.
score-service_1                 | Traceback (most recent call last):
score-service_1                 |   File "/usr/src/app/src/db/mysql.py", line 103, in execute
score-service_1                 |     self.cursor.execute(sql, parameters)
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 250, in execute
score-service_1                 |     self.errorhandler(self, exc, value)
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py", line 50, in defaulterrorhandler
score-service_1                 |     raise errorvalue
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 247, in execute
score-service_1                 |     res = self._query(query)
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 411, in _query
score-service_1                 |     rowcount = self._do_query(q)
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 374, in _do_query
score-service_1                 |     db.query(q)
score-service_1                 |   File "/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py", line 277, in query
score-service_1                 |     _mysql.connection.query(self, query)
score-service_1                 | _mysql_exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

There don't appear to be any table locks. Essentially, my question is why is this happening? Is my guess as to why I'm encountering hanging transactions correct? Are there any diagnostic tools or methods I could use to try and assess the problem further? I am personally still a bit new with SQL.

Upvotes: 0

Views: 570

Answers (1)

Rick James
Rick James

Reputation: 142538

No BEGIN? Is it running with autocommit=OFF? Is the entire program one giant transaction? No wonder it is timing out.

A "transaction" should be a short sequence of DB actions that need to be "atomically" run. That is, all succeed, or all fail, no in-between.

Normally a transaction is explicitly started with something like BEGIN or START TRANSACTION (interfaces have different variants).

When you include a lot of time-consuming client code inside the transaction, then competing transactions (from other clients) may be stalled waiting for access to whatever is locked. Try to avoid this.

Some design patterns work like this:

1. get a thing
2. do a lot of time-consuming processing on that thing
3. release that thing

While it is tempting to have the BEGIN in step 1 and the COMMIT in step 3. Step 2 says that is not wise. Instead

1a. BEGIN
1b. Find a thing to work on
1c. store (`UPDATE`) an indication that this client has grabbed the thing
1d. COMMIT
2. do a lot of time-consuming processing on that thing
3a. BEGIN
3b. Release (`UPDATE`) the thing
3c. COMMIT

This pattern is much more scalable.

It may be possible to collapse Steps 1a-1d into a single auto-committed statement. Ditto for Steps 3a-3c.

Thus goes the scaling of queued tasks and keeping the workers out of each other's hair. And going a long way toward avoiding "lock wait timeout".

Upvotes: 1

Related Questions