Reputation: 25
My input file is 20GB .txt file, So It faces performance issues when I test run the below code. pd.read_csv is taking more than 3 hours. Need to optimization in the reading stage.
Sample input file.
007064770000|SODIUM|95 MILLIGRAM|0
007064770000|MULTI|001|0
007064770000|PET STARCH FREE|NOT APPLICABLE|0
007064770000|GRAIN TYPE|FLOUR|0
003980010200|MULTI|001|0
003980010200|DEAL|NON-DEAL|0
003980010200|PRODUCT SIZE|1 COUNT|0
003980010200|BASE SIZE|1 COUNT|0
757582821517|HW APPLIANCES|001|0
757582821516|HW APPLIANCES|001|0
757582821517|PACKAGE GENERAL SHAPE|BOTTLE|0
757582821517|SYND FORM|CREAM|0
757582821517|FORM|CREAM|0
757582821517|TARGET SKIN CONDITION|DRY SKIN|0
003980010205|HW MEDICINE|NON-DEAL|0
003980010205|PRODUCT SIZE|1 COUNT|0
003980010205|BASE SIZE|1 COUNT|0
007064770054|SODIUM|95 MILLIGRAM|0
007064770054|HW SPORTS|001|0
007064770054|PET STARCH FREE|NOT APPLICABLE|0
007064770054|GRAIN TYPE|FLOUR|0
003980010312|HW DIAMETER|1 COUNT|0
003980010312|BASE SIZE|1 COUNT|0
Output file
UPC code HW APPLIANCES HW DIAMETER HW MEDICINE HW SPORTS
0 3980010205 NaN NaN NON-DEAL NaN
1 3980010312 NaN 1 COUNT NaN NaN
2 7064770054 NaN NaN NaN 001
3 757582821516 001 NaN NaN NaN
4 757582821517 001 NaN NaN NaN
Existing code
import pandas as pd
import datetime
df = pd.read_csv('sample.txt', sep='|', names=['upc_cd', 'chr_typ', 'chr_vl', 'chr_vl_typ'], engine='python')
df = df[df['chr_typ'].str.contains('HW ')]
df.sort_values('chr_typ')
df = (
df.iloc[:, :-1] # Remove last Column
.pivot(index=['upc_cd'], columns=['chr_typ'])
.droplevel(0, axis=1) # Fix Levels and axes names
.rename_axis('UPC code')
.rename_axis(None, axis=1)
.reset_index()
)
print(df)
df.to_csv('output.csv', sep=',', index=None, mode='w', encoding='utf-8')
Please suggest the modification to the code in order to reduce the running time
Upvotes: 0
Views: 714
Reputation: 1028
According to the docs the Python engine is slower than the default C engine. Do you really need it? Also it is worth a try to specify the dtype of every column beforehand, so pandas doesn't have to infer the dtypes itself. The docs list some additional parameter you can experiment with like memory_map.
Furthermore: It might be old fashioned but I don't think that you should put 20GB into a dataframe; also not into a single csv-file. I guess you are better off with a sqlite-database from which you can access data fast without loading it to memory.
Upvotes: 1
Reputation: 19
With pandas you need to split up the dataset, as the other answer suggests. However you can also use Dask, as it is better to handle large datasets. https://dask.org/
Upvotes: 0
Reputation: 4827
By default, pandas read_csv()
function will load the entire dataset into memory, and this could be a memory and performance issue when importing a huge CSV file.
read_csv()
has an argument called chunksize
that allows you to retrieve the data in a same-sized chunk. This is especially useful when reading a huge dataset as part of your data science project.
Example:
import pandas as pd
import numpy as np
np.random.seed(123)
n_row, n_col, chunksize = 1_000_000, 5, 100_000
pd.DataFrame(np.random.randn(n_row, n_col)).to_csv('sample_data.csv')
sum_list = list()
with pd.read_csv('sample_data.csv', chunksize=chunksize) as reader:
for chunk in reader:
sum_list.append(chunk.iloc[:, 0].sum())
print(sum_list)
print(sum(sum_list))
For more info see question 25962114 or the pandas documentation.
Upvotes: 0