Scalable way to get data ready / into pandas or consorts

Question

I have around 600GB of csv files, around 1 billion lines, stored in around 80 million text files.

For performing additional analysis, specifically network analysis, I would have to first aggregate some of the data and then do the analysis building part.

Normally, I would use a Database to parse and store the CSVs content and do the aggregation there, but I like the idea of working in-memory because of the heavy computing-resources at work.

For the aggregation type of stuff - how would one do that utilizing 40 Cores and 120GB of RAM? Pandas will probably not do the trick, what about Dask and Modin? Just reading the 600GB of CSV into a dataframe, aggregating and then again saving it to a csv seems like an idea...

Scalable way to get data ready / into pandas or consorts

Answers (0)

Related Questions