TheDaJon
TheDaJon

Reputation: 565

Pandas - split large excel file

I have an excel file with about 500,000 rows and I want to split it to several excel file, each with 50,000 rows.

I want to do it with pandas so it will be the quickest and easiest.

any ideas how to make it?

thank you for your help

Upvotes: 7

Views: 17302

Answers (5)

Behrooz Ostadaghaee
Behrooz Ostadaghaee

Reputation: 91

I wrote a function for this:

import numpy as np
import pandas as pd

def split_excel(file_name, n):  # n: number of chunks or parts (number of outputed excel files)
    df = pd.read_excel(file_name)
    l = len(df)
    c = l // n # c: number of rows
    r = l % c

    if r != 0:  # if it is not divisible
        df[-r:].to_excel(f'part_{l//c+1}.xlsx', index=False)
        df = df[:-r]

    i = 0
    for part in np.split(df, l//c):
        part.to_excel(f'part_{i}.xlsx', index=False)
        i += 1
        
        
split_excel('my_file.xlsx')

Upvotes: 0

user16005292
user16005292

Reputation: 51

import pandas as pd
l = pd.read_excel("inputfilename.xlsx")
total_size = 500,000
chunk_size = 50000
for i in range(0, total_size, chunk_size):
    df = l[i:i+chunk_size]
    df.to_excel(str(i)+"outputfilename.xlsx")

Upvotes: 2

wild
wild

Reputation: 311

use np.split_array as per this answer https://stackoverflow.com/a/17315875/1394890 if you get

array split does not result in an equal division

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

Assuming that your Excel file has only one (first) sheet containing data, I'd make use of chunksize parameter:

import pandas as pd
import numpy as np

i=0
for df in pd.read_excel(file_name, chunksize=50000):
    df.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
    i += 1

UPDATE:

chunksize = 50000
df = pd.read_excel(file_name)
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)

Upvotes: 9

Tarun Balani
Tarun Balani

Reputation: 11

As explained by MaxU, I will also make use of a variable chunksize and divide the total number of rows in large file into required number of rows.

import pandas as pd
import numpy as np

chunksize = 50000
i=0
df = pd.read_excel("path/to/file.xlsx")
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('path/to/destination/folder/file_{:02d}.xlsx'.format(i), index=True)
    i += 1

Hope this would help you.

Upvotes: 1

Related Questions