wilfo
wilfo

Reputation: 715

SQLAlchemy connection hangs on AWS MySQL RDS reboot with failover

We have a Python server which uses SQLAlchemy to read/write data from an AWS MySQL MultiAZ RDS instance.

We're experiencing a behavior we'd like to avoid where whenever we trigger a failover reboot, a connection which was open already and then issues a statement hangs indefinitely. While this is something to expect according to AWS documentation, we would expect the Python MySQL connector would be able to cope with this situation.

The closest case we've found on the web is this google groups thread which talks about the issue and offers a solution regarding a Postgres RDS.

For example, the below script will hang indefinitely when initiating a failover reboot (adopted from the above mention google groups thread).

from datetime import datetime
from time import time, sleep
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.scoping import scoped_session
from sqlalchemy.ext.declarative import declarative_base

import logging

current_milli_time = lambda: int(round(time() * 1000))
Base = declarative_base()

logging.basicConfig(format='%(asctime)s %(filename)s %(lineno)s %(process)d %(levelname)s: %(message)s', level="INFO")

class Message(Base):
    __tablename__ = 'message'
    id = Column(Integer, primary_key=True)
    body = Column(String(450), nullable=False)

engine = create_engine('mysql://<username>:<password>@<db_host>/<db_name>',echo=False, pool_recycle=1800,)
session_maker = scoped_session(sessionmaker(bind=engine, autocommit=False, autoflush=False))
session = session_maker()

while True:
    try:
        ids = ''
        start = current_milli_time()
        for msg in session.query(Message).order_by(Message.id.desc()).limit(5):
            ids += str(msg.id) + ', '
            logging.info('({!s}) (took {!s} ms) fetched ids: {!s}'.format(datetime.now().time().isoformat(), current_milli_time() - start, ids))

        start = current_milli_time()
        m = Message()
        m.body = 'some text'
        session.add(m)
        session.commit()
        logging.info('({!s}) (took {!s} ms) inserted new message'.format(datetime.now().time().isoformat(), current_milli_time() - start))

    except Exception, e:
        logging.exception(e)
        session.rollback()
    finally:
        session_maker.remove()

    sleep(0.25)

We've tried playing with the connection timeouts but it seems the issue is related to an already opened connection which simply hangs once AWS switches to the failover instance.

Our question is - has anyone encountered this issue or has possible directions worthwhile checking?

Upvotes: 3

Views: 2375

Answers (1)

mootmoot
mootmoot

Reputation: 13166

IMHO, using SQL connector timeout to handle switchcover is like black magic. Each connector always act differently and difficult to diagnose.

If you read @univerio comment again, AWS will reassign a new IP address for the SAME RDS endpoint name. While doing the switching, your RDS endpoint name and old IP adderss is still inside your server instance DNS cache. So this is a DNS caching issues, and that's why AWS ask you to "clean up....".

Unless you restart SQLAlchemy to read the DNS again, there is no way that the session know something happens and switch it dynamically. And worst, the issue can be happens in connector that used by SQLAlchemy.

IMHO, it doesn't worth the effort to deal with switch over inside the code. I will just subscribe to AWS service like lambda that can act upon switch over events, trigger the app server to restart the connection, which suppose to reflect the new IP address.

Upvotes: 4

Related Questions