user6730906
user6730906

Reputation:

Python pandas large database using excel

I am comfortable using python / excel / pandas for my dataFrames . I do not know sql or database languages .

I am about to start on a new project that will include around 4,000 different excel files I have. I will call to have the file opened saved as a dataframe for all 4000 files and then do my math on them. This will include many computations such a as sum , linear regression , and other normal stats.

My question is I know how to do this with 5-10 files no problem. Am I going to run into a problem with memory or the programming taking hours to run? The files are around 300-600kB . I don't use any functions in excel only holding data. Would I be better off have 4,000 separate files or 4,000 tabs. Or is this something a computer can handle without a problem? Thanks for looking into have not worked with a lot of data before and would like to know if I am really screwing up before I begin.

Upvotes: 1

Views: 211

Answers (1)

tvashtar
tvashtar

Reputation: 4315

You definitely want to use a database. At nearly 2GB of raw data, you won't be able to do too much to it without choking your computer, even reading it in would take a while.

If you feel comfortable with python and pandas, I guarantee you can learn SQL very quickly. The basic syntax can be learned in an hour and you won't regret learning it for future jobs, its a very useful skill.

I'd recommend you install PostgreSQL locally and then use SQLAlchemy to connect to create a database connection (or engine) to it. Then you'll be happy to hear that Pandas actually has df.to_sql and pd.read_sql making it really easy to push and pull data to and from it as you need it. Also SQL can do any basic math you want like summing, counting etc.

Connecting and writing to a SQL database is as easy as:

from sqlalchemy import create_engine
my_db = create_engine('postgresql+psycopg2://username:password@localhost:5432/database_name')
df.to_sql('table_name', my_db, if_exists='append')

I add the last if_exists='append' because you'll want to add all 4000 to one table most likely.

Upvotes: 2

Related Questions