Saif
Saif

Reputation: 95

reading from big csv file and save rows which meet condition into another df

I have a big csv file (dataset) with size 443,00 KB. The photo shows a sample of the file. I want save rows into another dataframe. I tried this way, but it is taking so much time e

import pandas as pd
df = pd.DataFrame()
for chunk in pd.read_csv("UsersVle.csv", chunksize=10):
    for i, row in chunk.iterrows():
        if((row['module']=='D3') & (row['presentation']=='13B')):
            df.append(row)

Searching for a solution, I found something about chuncksize and tried it this way , but there was error TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]

import itertools as IT

chunksize = 10 ** 3
chunks = pd.read_csv('UsersVle.csv', chunksize=chunksize)
chunks = IT.takewhile(lambda chunk: (chunk['module']=='D3' & chunk['presentation']=='13B'), chunks)
df = pd.concat(chunks) 

I need an efficient way to read from this big file and save the rows meeting the condition into another dataframe. I will appreciate your help. PS, I tried dask, but it seems did not read the file as I used df.head(), the were no returned rows !. enter image description here

Upvotes: 0

Views: 408

Answers (1)

S2L
S2L

Reputation: 1934

Not very scientific, but something to get an idea to adjust the value:

Using moviedataset/rating.csv:

print('Without chunksize')
start = timer()
df = pd.read_csv('ml-latest/ratings.csv')
df2 = df[df["rating"] == 5.0]
print(timer() - start)

for cs in range(4, 10):
    print('Chunk size', cs, 10 ** cs)
    start = timer()
    rdr= pd.read_csv('ml-latest/ratings.csv', chunksize=10**cs)
    df2 = pd.concat([chunk[chunk['rating'] == 5.0] for chunk in rdr])
    print(timer() - start)

Output:

Without chunksize
5.055990324995946
Chunk size 4 10000
8.80516574899957
Chunk size 5 100000
5.21452364900324
Chunk size 6 1000000
4.814042658996186
Chunk size 7 10000000
4.8958623920043465
Chunk size 8 100000000
5.152557591005461
Chunk size 9 1000000000
5.076704847000656

Upvotes: 1

Related Questions