Reputation: 9363
I have an extremely large CSV file which has more than 500 million rows.
But I only need a few thousand rows from it based on a certain condition. I am at the moment using:
with open('/home/Documents/1681.csv', 'rb') as f:
reader = csv.DictReader(f)
rows = [row for row in reader if row['flag_central'] == 1]
Here the condition is that if the flag_central == 1
, I need the row.
However, since the file is extremely huge, I am not able to perform the above code. I believe it is because of the for
loop I am using, which is causing this trouble.
Is there anyway I can extract these certain rows from the CSV file based on the above condition?
Upvotes: 1
Views: 2687
Reputation: 10359
If this is a repetitive process and/or you have more complex conditions to process, here is a fast, low-memory approach in Python that will get you there quickly:
#!/usr/bin/env python
# put this in parsecsv.py, then chmod +x parsecsv.py
import sys
output = lambda l: sys.stdout.write(l)
for line in sys.stdin:
fields = line.split(',')
# add your conditions below
# call output(line) to output
if fields[0] == "foo":
output(line)
This is intended to be used as a pipeline filter from the command line:
$ cat file | parsecsv > extract.csv
Actually I wrote a somewhat more generic & maintainable template that you might find useful .
Upvotes: 1
Reputation: 10359
If this is a one-time task, I would suggest using unix commands first, then process the extract:
cat file | awk -F , '{ if ($5 == "1") print $0 }' > extract.csv
where -F specifies the column delimiter and 5 is the column number. figure this out first by
cat file | head -n 1 | tr ',' '\n' | nl | grep flag_central
=>
5 flag_central
^ this is the field number ($5)
This way you will not incur the cost of converting the csv file into python objects first. Depending on your use case YMMV.
Upvotes: 3
Reputation: 16099
You could use Pandas. The only caveat I would have would be that with such a large file you would need to import the file in portions.
import pandas as pd
tp = pd.read_csv('/home/Documents/1681.csv', iterator=True, chunksize=10000)
df = pd.concat(tp, ignore_index=True)
From there you would then be able to extract the row you are interested in with:
rows = df[df['flag-central'] == 1]
If you would like to return this to a csv file you could then use to_csv:
rows.to_csv('filename.csv')
Upvotes: 1
Reputation: 394159
You can do this using pandas
:
import pandas as pd
chunk_list=[]
for chunk in pd.read_csv('/home/Documents/1681.csv', chunksize=10000):
chunk_list.append(chunk[chunk['flag_central'] == 1]`
final_df = pd.concat(chunk_list)
Basically this will read 10000 rows at a time and filter the rows that don't meet your condition, these get appended to a list and when complete the chunks are concatenated into a final dataframe
Upvotes: 3