MichM
MichM

Reputation: 896

pandas to_sql truncates my data

I was using df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') to export a data frame into mysql. However, I discovered that the columns with long string content (such as url) is truncated to 63 digits. I received the following warning from ipython notebook when I exported:

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248: Warning: Data truncated for column 'url' at row 3 cur.executemany(insert_query, data)

There were other warnings in the same style for different rows.

Is there anything I can tweak to export the full data properly? I could set up the correct data schema in mysql and then export to that. But I am hoping a tweak can just make it work straight from python.

Upvotes: 3

Views: 7747

Answers (2)

ted.strauss
ted.strauss

Reputation: 4339

Inspired by @joris's answer, I decided to hard code the change into the panda's source and re-compile.

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

changed line 871

'mysql': 'VARCHAR (63)',

to

'mysql': 'VARCHAR (255)',

then recompiled the just that file

sudo python -m py_compile sql.py

restarted my script and _to_sql() function wrote a table. (I expected that the recompile would have broken pandas, but seems to have not.)

here is my script to write a dataframe to mysql, for reference.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)

Upvotes: 5

joris
joris

Reputation: 139172

If you are using pandas 0.13.1 or older, this limit of 63 digits is indeed hardcoded, because of this line in the code: https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278

As a workaround, you could maybe monkeypatch that function get_sqltype:

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

And then just using your code should work:

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

Starting from pandas 0.14, the sql module is uses sqlalchemy under the hood, and strings are converted to the sqlalchemy TEXT type, wich is converted to the mysql TEXT type (and not VARCHAR), and this will also allow you to store larger strings than 63 digits:

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

Only if you still use the a DBAPI connection instead of a sqlalchemy engine, the issue remains, but this option is deprecated and it is recommended to provide an sqlalchemy engine to to_sql.

Upvotes: 9

Related Questions