Reputation: 1142
I have a huge MySQL table with millions of records. I have a use case where, based on date column, I want to split and dump the results in different text file.
For example:
col_1 col_2 date_col
a b 20-12-2021
a b 20-12-2021
a2 b2 21-12-2021
.. .. ...
So row1
and row2
will go to a CSV called 20_12_2021.csv
and row3
will go to 21_12_2021.csv
. What I am doing right now is I am using Python to fetch results and then loop over row by row and then determine where the result will go, but it is taking a huge time.
Is there any better more optimized way to do this?
Upvotes: 2
Views: 474
Reputation: 8816
I like the solution, where you can use df.groupby
with date_col
column, which will Group DataFrame using a mapper or by a Series of columns on the axis 0
that's by default, and signifies rows.
for name, g in df.groupby('date_col'):
g.to_csv(f'{name}.csv', index=False)
Result:
$ cat 20-12-2021.csv <-- CSV file created
col_1,col_2,date_col
a,b,20-12-2021
a,b,20-12-2021
$ cat 21-12-2021.csv <-- CSV file created
col_1,col_2,date_col
a2,b2,21-12-2021
OR
df.groupby('date_col').apply(lambda x: x.to_csv('{}.csv'.format(x.name), index=False))
Note: if you are curiosu about MYSQL efficiently for a million row then, there is a nice answer here Pandas now has built-in support for chunked loading by @ThePhysicist.
Alao look at the Reading table with chunksize still pumps the memory
Upvotes: 1