Reputation: 25

pd.read_csv optimization to reduce the running time

My input file is 20GB .txt file, So It faces performance issues when I test run the below code. pd.read_csv is taking more than 3 hours. Need to optimization in the reading stage.

Sample input file.

007064770000|SODIUM|95 MILLIGRAM|0
007064770000|MULTI|001|0
007064770000|PET STARCH FREE|NOT APPLICABLE|0
007064770000|GRAIN TYPE|FLOUR|0
003980010200|MULTI|001|0
003980010200|DEAL|NON-DEAL|0
003980010200|PRODUCT SIZE|1 COUNT|0
003980010200|BASE SIZE|1 COUNT|0
757582821517|HW APPLIANCES|001|0
757582821516|HW APPLIANCES|001|0
757582821517|PACKAGE GENERAL SHAPE|BOTTLE|0
757582821517|SYND FORM|CREAM|0
757582821517|FORM|CREAM|0
757582821517|TARGET SKIN CONDITION|DRY SKIN|0
003980010205|HW MEDICINE|NON-DEAL|0
003980010205|PRODUCT SIZE|1 COUNT|0
003980010205|BASE SIZE|1 COUNT|0
007064770054|SODIUM|95 MILLIGRAM|0
007064770054|HW SPORTS|001|0
007064770054|PET STARCH FREE|NOT APPLICABLE|0
007064770054|GRAIN TYPE|FLOUR|0
003980010312|HW DIAMETER|1 COUNT|0
003980010312|BASE SIZE|1 COUNT|0

Output file

       UPC code HW APPLIANCES HW DIAMETER HW MEDICINE HW SPORTS
0    3980010205           NaN         NaN    NON-DEAL       NaN
1    3980010312           NaN     1 COUNT         NaN       NaN
2    7064770054           NaN         NaN         NaN       001
3  757582821516           001         NaN         NaN       NaN
4  757582821517           001         NaN         NaN       NaN

Existing code

import pandas as pd
import datetime

df = pd.read_csv('sample.txt', sep='|', names=['upc_cd', 'chr_typ', 'chr_vl', 'chr_vl_typ'], engine='python')
df = df[df['chr_typ'].str.contains('HW ')]     
df.sort_values('chr_typ')
df = (
    df.iloc[:, :-1]  # Remove last Column
        .pivot(index=['upc_cd'], columns=['chr_typ'])
        .droplevel(0, axis=1)  # Fix Levels and axes names
        .rename_axis('UPC code')
        .rename_axis(None, axis=1)
        .reset_index()
)
print(df)
df.to_csv('output.csv', sep=',', index=None, mode='w', encoding='utf-8')

Please suggest the modification to the code in order to reduce the running time

Upvotes: 0

Answers (3)

Durtal

Reputation: 1028

According to the docs the Python engine is slower than the default C engine. Do you really need it? Also it is worth a try to specify the dtype of every column beforehand, so pandas doesn't have to infer the dtypes itself. The docs list some additional parameter you can experiment with like memory_map.

Furthermore: It might be old fashioned but I don't think that you should put 20GB into a dataframe; also not into a single csv-file. I guess you are better off with a sqlite-database from which you can access data fast without loading it to memory.

Upvotes: 1

OleGregersen

Reputation: 19

With pandas you need to split up the dataset, as the other answer suggests. However you can also use Dask, as it is better to handle large datasets. https://dask.org/

Upvotes: 0

René

Reputation: 4827

By default, pandas read_csv() function will load the entire dataset into memory, and this could be a memory and performance issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk. This is especially useful when reading a huge dataset as part of your data science project.

Example:

import pandas as pd
import numpy as np

np.random.seed(123)

n_row, n_col, chunksize = 1_000_000, 5, 100_000
pd.DataFrame(np.random.randn(n_row, n_col)).to_csv('sample_data.csv')

sum_list = list()
with pd.read_csv('sample_data.csv', chunksize=chunksize) as reader:
    for chunk in reader:
        sum_list.append(chunk.iloc[:, 0].sum())

print(sum_list)

print(sum(sum_list))

For more info see question 25962114 or the pandas documentation.

Upvotes: 0

pd.read_csv optimization to reduce the running time

Answers (3)

Related Questions