billyc59
billyc59

Reputation: 91

pandas: split dataframe into multiple csvs

I have a large file, imported into a single dataframe in Pandas. I'm using pandas to split up a file into many segments, by the number of rows in the dataframe.

eg: 10 rows: file 1 gets [0:4] file 2 gets [5:9]

Is there a way to do this without having to create more dataframes?

Upvotes: 3

Views: 13876

Answers (4)

kerfuffle
kerfuffle

Reputation: 55

use numpy.array_split to split your dataframe dfX and save it in N csv files of equal size: dfX_1.csv to dfX_N.csv

N = 10
for i, df in enumerate(np.array_split(dfX, N)):
    df.to_csv(f"dfX_{i + 1}.csv", index=False)

Upvotes: 2

billyc59
billyc59

Reputation: 91

iterating over iloc's arguments will do the trick.

Upvotes: 0

Neil
Neil

Reputation: 14321

There are two ways of doing this. I believe you are looking for the former. Basically, we open a series of csv writers, then we write to the correct csv writer by using some basic math with the index, then we close all files.

A single DataFrame evenly divided into N number of CSV files

import pandas as pd
import csv, math

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
fileOpens = [open(f"out{i}.csv","w") for i in range(NUMBER_OF_SPLITS)]
fileWriters = [csv.writer(v, lineterminator='\n') for v in fileOpens]
for i,row in df.iterrows():
    fileWriters[math.floor((i/df.shape[0])*NUMBER_OF_SPLITS)].writerow(row.tolist())
for file in fileOpens:
    file.close()

More than one DataFrame evenly divided into N number of CSV files

import pandas as pd
import numpy as np

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10]) # uncreative input values for 10 columns
NUMBER_OF_SPLITS = 2
for i, new_df in enumerate(np.array_split(df,NUMBER_OF_SPLITS)):
    with open(f"out{i}.csv","w") as fo:
            fo.write(new_df.to_csv())

Upvotes: 4

BENY
BENY

Reputation: 323366

assign a new column g here, you just need to specific how many item you want in each groupby, here I am using 3 .

df.assign(g=df.index//3)
Out[324]: 
    0  g
0   1  0
1   2  0
2   3  0
3   4  1
4   5  1
5   6  1
6   7  2
7   8  2
8   9  2
9  10  3

and you can call the df[df.g==1] to get what you need

Upvotes: 4

Related Questions