Reputation: 8796
I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?
It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:
The data might have errors (some rows may be not be parseable), the data might be duplicated, the data might be really large.
Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).
I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.
Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?
Upvotes: 3
Views: 3542
Reputation: 1154
I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
Upvotes: 1
Reputation: 35998
A quick googling didn't reveal any serious drawbacks in pandas.read_csv
regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas
to load the data into the DB, too, reading directly into a DataFrame
is a huge boost in both performance and memory (no redundant copies).
There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv
in chunks with pandas
.
Regarding "The data might have errors", read_csv
has a few facilities like converters
, error_bad_lines
and skip_blank_lines
(specific course of action depends on if and how much corruption you're supposed to be able to recover).
Upvotes: 2