user8834780
user8834780

Reputation: 1670

pandas dataframe header relation to sql table header

I have a DataFrame with 100+ columns that I am trying to write into a SQL table. This is a daily process with a full data load, and it looks like so:

now = datetime.datetime.now()
filename = 'extract_' + str(now)[:10]
output = "./output"

sql_df = pd.read_csv(os.path.join(output,filename + '.csv'))
server = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(server, connect_args={'sslmode': 'verify-ca'}, use_batch_mode=True)
conn = engine.raw_connection()

conn.cursor().execute("truncate table")
%time sql_df.to_sql('table', engine, index=False, if_exists='append') 

I am getting the following error: ProgrammingError: (psycopg2.ProgrammingError) column "column" of relation "table" does not exist.

Can someone clarify why does to_sql compare the headers of the DataFrame and the table (and doesn't allow operation unless all match), rather than just inserting values appropriately excluding header row? If instead of writing DF to SQL I upload csv to S3, then use "copy"- there are no errors..

Anything I can do for it to ignore headers and just insert values?

Upvotes: 3

Views: 4892

Answers (1)

d_kennetz
d_kennetz

Reputation: 5359

You can modify the input df to match the names in the table like this (where db_cols is your database column names), I think this should work for your MySQLdb situation:

db_cols = list(pd.read_sql('...')) # where ... is your table will return columns as list

(sql_df
 .rename(columns=dict(zip(sql_df.columns, db_cols)))
 .to_sql(name="table",
         con=alch_engine,
         if_exists="append",
         index=False,
         index_label=None))

Upvotes: 4

Related Questions