Reputation: 95
I have a big csv file (dataset) with size 443,00 KB. The photo shows a sample of the file. I want save rows into another dataframe. I tried this way, but it is taking so much time e
import pandas as pd
df = pd.DataFrame()
for chunk in pd.read_csv("UsersVle.csv", chunksize=10):
for i, row in chunk.iterrows():
if((row['module']=='D3') & (row['presentation']=='13B')):
df.append(row)
Searching for a solution, I found something about chuncksize and tried it this way , but there was error TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]
import itertools as IT
chunksize = 10 ** 3
chunks = pd.read_csv('UsersVle.csv', chunksize=chunksize)
chunks = IT.takewhile(lambda chunk: (chunk['module']=='D3' & chunk['presentation']=='13B'), chunks)
df = pd.concat(chunks)
I need an efficient way to read from this big file and save the rows meeting the condition into another dataframe. I will appreciate your help. PS, I tried dask, but it seems did not read the file as I used df.head(), the were no returned rows !.
Upvotes: 0
Views: 408
Reputation: 1934
Not very scientific, but something to get an idea to adjust the value:
Using moviedataset/rating.csv:
print('Without chunksize')
start = timer()
df = pd.read_csv('ml-latest/ratings.csv')
df2 = df[df["rating"] == 5.0]
print(timer() - start)
for cs in range(4, 10):
print('Chunk size', cs, 10 ** cs)
start = timer()
rdr= pd.read_csv('ml-latest/ratings.csv', chunksize=10**cs)
df2 = pd.concat([chunk[chunk['rating'] == 5.0] for chunk in rdr])
print(timer() - start)
Output:
Without chunksize
5.055990324995946
Chunk size 4 10000
8.80516574899957
Chunk size 5 100000
5.21452364900324
Chunk size 6 1000000
4.814042658996186
Chunk size 7 10000000
4.8958623920043465
Chunk size 8 100000000
5.152557591005461
Chunk size 9 1000000000
5.076704847000656
Upvotes: 1