Reputation: 345
It seems that loading data from a CSV is faster than from SQL (Postgre SQL) with Pandas. (I have a SSD)
Here is my test code :
import pandas as pd
import numpy as np
start = time.time()
df = pd.read_csv('foo.csv')
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))
engine = create_engine('postgresql://user:password@host:port/schema')
start = time.time()
df = pd.read_sql_query("select * from mytable", engine)
df *= 3
duration = time.time() - start
print('{0}s'.format(duration))
The foo.csv and the database are the same (same amount of data and columns in both, 4 columns, 100 000 rows full of random int).
CSV takes 0.05s
SQL takes 0.5s
Do you think it's normal that CSV is 10 time faster than SQL ? I'm wondering if I'm missing something here...
Upvotes: 13
Views: 15394
Reputation: 5591
While Steven G's explanation of the process is fundamentally the answer to your question, and Simon G's COPY
solution is the most efficient one I've been able to find, I decided to look into your question in more depth, and actually measure the different aspects relevant to it.
At https://github.com/mikaelhg/pandas-pg-csv-speed-poc there is a project which contains pytest benchmarks for the various alternative solutions.
The CSV for this test is a order of magnitude larger than in the question, with the shape of (3742616, 6)
. Just to make sure that there is less of a chance for various buffers being just the right size to skew the results.
Thanks to the Finnish Traffic Safety Bureau Trafi's open data initiative for providing the test data.
As for the PostgreSQL installation, it's inside the canonical Docker container, and was started with upped shared_buffers
and work_mem
values, with the data files stored under the host machine's /dev/shm
mount point, in order to negate actual disk I/O. Its UNIX socket connection point is also similarly exposed.
version: '3'
services:
db:
image: 'postgres:10-alpine'
command: "postgres -c 'shared_buffers=512MB' -c 'temp_buffers=80MB' -c 'work_mem=256MB'"
ports:
- '5432:5432'
volumes:
- '/dev/shm/pgtest/data:/var/lib/postgresql/data'
- '/dev/shm/pgtest/run:/var/run/postgresql'
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: test
test:
image: pandas_speed_poc:temp
build:
context: .
dockerfile: Dockerfile.test-runner
volumes:
- '.:/app'
- '/dev/shm/pgtest/run:/var/run/postgresql'
working_dir: '/app'
user: '1000'
The test runner is a simple Ubuntu 18.04 container:
FROM ubuntu:18.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
apt-get -y -qq install python3-dev python3-pip python3-psycopg2 \
build-essential \
bash less nano wait-for-it
RUN pip3 install sqlalchemy numpy pandas \
pytest pytest-benchmark
WORKDIR /app
CMD wait-for-it db:5432 -- /bin/bash -c "trap : TERM INT; sleep infinity & wait"
The actual benchmark is a Python 3 unittest
written for pytest-benchmark
:
#!/usr/bin/python3
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extensions
import pandas as pd
import numpy as np
import io
import time
import gzip
import unittest
import pytest
DATA_FILE = 'data/licenses.csv.gz'
DROP_TABLE = "DROP TABLE IF EXISTS licenses"
CREATE_TABLE = """
CREATE TABLE licenses (
a VARCHAR(16),
b CHAR(3),
c CHAR(6),
d INTEGER,
e INTEGER,
f INTEGER
)
"""
COPY_FROM = """
COPY licenses (a, b, c, d, e, f) FROM STDIN
WITH (FORMAT CSV, DELIMITER ';', HEADER)
"""
COPY_TO = "COPY licenses TO STDOUT WITH (FORMAT CSV, HEADER)"
SELECT_FROM = 'SELECT * FROM licenses'
VACUUM = "VACUUM FULL ANALYZE"
DB_UNIX_SOCKET_URL = 'postgresql://test:test@/test'
DB_TCP_URL = 'postgresql://test:test@db/test'
def my_cursor_factory(*args, **kwargs):
cursor = psycopg2.extensions.cursor(*args, **kwargs)
cursor.itersize = 10240
return cursor
class TestImportDataSpeed(unittest.TestCase):
@pytest.fixture(autouse=True)
def setupBenchmark(self, benchmark):
self.benchmark = benchmark
@classmethod
def setUpClass(cls):
cls.engine = create_engine(DB_TCP_URL, connect_args={'cursor_factory': my_cursor_factory})
connection = cls.engine.connect().connection
cursor = connection.cursor()
cursor.execute(DROP_TABLE)
cursor.execute(CREATE_TABLE)
with gzip.open(DATA_FILE, 'rb') as f:
cursor.copy_expert(COPY_FROM, file=f, size=1048576)
connection.commit()
connection.set_session(autocommit=True)
cursor.execute(VACUUM)
cursor.close()
connection.close()
def test_pd_csv(self):
def result():
return pd.read_csv(DATA_FILE, delimiter=';', low_memory=False)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_psycopg2_cursor(self):
def result():
connection = self.engine.connect().connection
cursor = connection.cursor()
cursor.itersize = 102400
cursor.arraysize = 102400
cursor.execute(SELECT_FROM)
rows = cursor.fetchall()
cursor.close()
connection.close()
return pd.DataFrame(rows)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pd_sqla_naive(self):
def result():
return pd.read_sql_query(SELECT_FROM, self.engine)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pd_sqla_chunked(self):
def result():
gen = (x for x in pd.read_sql(SELECT_FROM, self.engine, chunksize=10240))
return pd.concat(gen, ignore_index=True)
df = self.benchmark(result)
assert df.shape == (3742616, 6)
def test_pg_copy(self):
connection = self.engine.connect().connection
cursor = connection.cursor()
def result(cursor):
f = io.StringIO()
cursor.copy_expert(COPY_TO, file=f, size=1048576)
f.seek(0)
return pd.read_csv(f, low_memory=False)
df = self.benchmark(result, cursor)
assert df.shape == (3742616, 6)
The final results:
speed_test.py .....
-------------------------------------------------------------------------------- benchmark: 5 tests -------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pd_csv 1.4623 (1.0) 1.4903 (1.0) 1.4776 (1.0) 0.0110 (1.21) 1.4786 (1.0) 0.0171 (1.15) 2;0 0.6768 (1.0) 5 1
test_pg_copy 3.0631 (2.09) 3.0842 (2.07) 3.0732 (2.08) 0.0091 (1.0) 3.0769 (2.08) 0.0149 (1.0) 2;0 0.3254 (0.48) 5 1
test_psycopg2_cursor 4.5325 (3.10) 4.5724 (3.07) 4.5531 (3.08) 0.0161 (1.77) 4.5481 (3.08) 0.0249 (1.68) 2;0 0.2196 (0.32) 5 1
test_pd_sqla_naive 6.0177 (4.12) 6.0523 (4.06) 6.0369 (4.09) 0.0147 (1.62) 6.0332 (4.08) 0.0242 (1.63) 2;0 0.1656 (0.24) 5 1
test_pd_sqla_chunked 6.0247 (4.12) 6.1454 (4.12) 6.0889 (4.12) 0.0442 (4.86) 6.0963 (4.12) 0.0524 (3.52) 2;0 0.1642 (0.24) 5 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Upvotes: 8
Reputation: 501
When working with a PostgreSQL database, you can use a combination of SQL and CSV to get the best from both methods. SQL to select exactly the data you need and CSV output to quickly load it into a pandas DataFrame.
conn = psycopg2.connect(**conn_params)
with conn.cursor() as cur:
sql = 'SELECT * FROM large_table'
buf = io.StringIO()
cur.copy_expert(f'COPY ({sql}) TO STDOUT WITH CSV HEADER', buf)
buf.seek(0)
df = pd.read_csv(buf, header=0, low_memory=False,
true_values='t', false_values='f')
conn.close()
This uses PostgreSQL's fast COPY command in combination with psycopg2's copy_expert()
function to read query results into a string buffer in CSV format. You can then use pandas read_csv()
on that string buffer.
The drawback is that you may have to convert data types afterwards (e.g. timestamps would be strings). The read_csv()
function has a few parameters that can help deal with that (e.g. parse_dates
, true_values
, false_values
, ...).
In my use case (30 million rows, 15 columns) this gave me a performance improvement of about 2-3x compared to the pandas read_sql()
function.
Upvotes: 7
Reputation: 1314
It's perfectly normal that CSV is much faster than SQL, but they are not meant for the same thing, even though you can use them for the same thing :
CSV is for sequential access, ie, you start at the beginning of the file and you read each row one after the other, treating it as needed.
SQL is for indexed access, ie, you look at an index and then you go to the row you're looking for. You can also perform a full table scan, ie not make use of any index, which makes the table essentially a bloated CSV.
Your query is a full table scan, it doesn't look at the index, because it goes for ALL the data, so yes, it's normal.
On the other hand, if you try a query like
select * from mytable where myindex = "myvalue";
You'll get a HUGE boost compared to searching for the same rows in the csv. That's because of the indexes in SQL
Upvotes: 1
Reputation: 17122
This is a normal behavior, reading a csv file is always one of the quickest way to simply load data
A CSV is very naive and simple. loading directly from it will be very quick. For massive database with complex structure CSV is not an option. SQL is super fast to select data from table an return that data to you. naturally, if you can select, modify and manipulate data it will add an overhead time cost to your call.
imagine that you have a time series in a csv from 1920 to 2017 in a csv but you only want data from 2010 to today.
csv approach would be to load the entire csv then select the years 2010 to 2017.
SQL approach would be to pre-select the years via SQL select function
In that scenario, SQL would be MUCH faster.
Upvotes: 14