user7779326
user7779326

Reputation:

How to input large data into python pandas using looping or parallel computing?

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0, dtype=str)

I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.

Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.

My csv file contains numerous data with '/' as the comma separator,

id    venue           time             code    value ......
AAA   Paris      28/05/2016 09:10      PAR      45   ......
111   Budapest   14/08/2016 19:00      BUD      62   ......
AAA   Tokyo      05/11/2016 23:20      TYO      56   ......
111   LA         12/12/2016 05:55      LAX      05   ......
111   New York   08/01/2016 04:25      NYC      14   ......
AAA   Sydney     04/05/2016 21:40      SYD      2    ......
ABX   HongKong   28/03/2016 17:10      HKG      5    ......
ABX   London     25/07/2016 13:02      LON      22   ......
AAA   Dubai      01/04/2016 18:45      DXB      19   ......
.
.
.
.

Upvotes: 12

Views: 1746

Answers (5)

gbajson
gbajson

Reputation: 1730

If you don't need all columns you may also use usecols parameter:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

usecols : array-like or callable, default None

Return a subset of the columns. [...] 
Using this parameter results in much faster parsing time and lower memory usage.

Upvotes: 0

SayPy
SayPy

Reputation: 566

import numpy as np
from multiprocessing import Pool

def processor(df):

    # Some work

    df.sort_values('id', inplace=True)
    return df

size = 8
df_split = np.array_split(df, size)

cores = 8
pool = Pool(cores)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
    frame.to_csv('{}'.format(n))
pool.close()
pool.join()

Upvotes: 9

VinceP
VinceP

Reputation: 2163

Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv"
chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)

for it, chunk in enumerate(chunks):
    chunk.to_csv('chunk_{}.csv'.format(it), sep="/") 

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

Upvotes: 6

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

Upvotes: 0

nitin
nitin

Reputation: 7358

pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to read
skiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Upvotes: 5

Related Questions