Antony
Antony

Reputation: 443

why does connection pooling slow down the sql queries in flask

Connection pooling is supposed to improve the throughput of postgres or at least this is what everyone says when googling what is pooling and its benefits, however whenever I experiment with connection pooling in flask the result is always that it is significantly slower then just opening a connection and a cursor at the beginning of a file and never closing them . If my webapp is constantly getting requests from users why do we even close the connection and the cursor, isn't it better to create a connection and a cursor once and then whenever we get a request whether a GET or POST request simply use the existing cursor and connection . Am I missing something here ?! Here is the timing for each approach and below is the code I ran to benchmark each approach

it took 16.537524700164795 seconds to finish 100000 queries with one database connection that we opened once at the begining of the flask app and never closed

it took 38.07477355003357 seconds to finish 100000 queries with the psqycopg2 pooling approach

it took 52.307902574539185 seconds to finish 100000 queries with pgbouncer pooling approach

also here is a video running the test with the results in case it is of any help https://youtu.be/V2kzKApDs8Y

The flask app that I used to benchmark each approach is

import psycopg2
import time
from psycopg2 import pool
from flask import Flask

app = Flask(__name__)

connection_pool = pool.SimpleConnectionPool(1, 50,host="localhost",database="test",user="postgres",password="test",port="5432")

connection = psycopg2.connect(host="127.0.0.1",database="test",user="postgres",password="test",port="5432")
cursor = connection.cursor()

pgbouncerconnection_pool = pool.SimpleConnectionPool(1, 50, host="127.0.0.1",database="test",user="postgres",password="test",port="6432")

@app.route("/poolingapproach")
def zero():
    start = time.time()
    for x in range(100000):
        with connection_pool.getconn() as connectionp:
            with connectionp.cursor() as cursorp:
                cursorp.execute("SELECT *  from tb1 where id = %s" , [x%100])
                result = cursorp.fetchone()
                connection_pool.putconn(connectionp)
    y = "it took " +  str(time.time() - start) + " seconds to finish 100000 queries with the pooling approach"
    return str(y) , 200

@app.route("/pgbouncerpooling")
def one():
        start = time.time()
        for x in range(100000):
                with pgbouncerconnection_pool.getconn() as pgbouncer_connection:
                        with pgbouncer_connection.cursor() as pgbouncer_cursor:
                                pgbouncer_cursor.execute("SELECT *  from tb1 where id = %s" , [x%100])
                                result = pgbouncer_cursor.fetchone()
                                pgbouncerconnection_pool.putconn(pgbouncer_connection)
        a = "it took " +  str(time.time() - start) + " seconds to finish 100000 queries with pgbouncer pooling approach"
        return str(a) , 200



@app.route("/oneconnection_at_the_begining")
def two():
    start = time.time()
    for x in range(100000):
        cursor.execute("SELECT * from tb1 where id = %s",[x%100])
        result = cursor.fetchone()
    end = time.time()
    x = 'it took ' + str(end - start)+ ' seconds to finish 100000 queries with one database connection that we don\'t close'
    return str(x) , 200
if  __name__=="__main__":
    app.run()

Upvotes: 0

Views: 983

Answers (1)

shikida
shikida

Reputation: 505

I am not sure how you're testing, but the idea of the connection pool is to address basically 2 things:

  1. you have way more users trying to connect to a db that a db can directly handle (e.g. your db allow 100 connections and there are 1000 users trying to use it at the same time) and
  2. save time that you would spend opening and closing connections because they're always open already.

So I believe your tests must be focused on these two situations.

If your test is just 1 user requesting a lot of queries, connection pool will be just an overhead, because it will try to open extra connections that you just don't need.

If your test always return the same data, your DBMS will probably cache the results and the query plan, so it will affect the results. In fact, in order to scale things, you could even benefit from a secondary cache such as elasticsearch.

Now, if you want to perform a realistic test, you must mix read and write operations, add some random variables into it (forcing the DB not to cache results or query plans) and try for incremental loads, so you can see how the system behaves each time you add more simultaneous clients performing requests.

And because these clients also add more CPU load to the test, you may also consider to run the clients in a different machine than the one that serves your DB, to keep the results fair.

Upvotes: 1

Related Questions