Srivatsan
Srivatsan

Reputation: 9363

Extracting rows from an extremely large (48GB) CSV file based on condition

I have an extremely large CSV file which has more than 500 million rows.

But I only need a few thousand rows from it based on a certain condition. I am at the moment using:

with open('/home/Documents/1681.csv', 'rb') as f:
    reader = csv.DictReader(f)
    rows = [row for row in reader if row['flag_central'] == 1]

Here the condition is that if the flag_central == 1, I need the row.

However, since the file is extremely huge, I am not able to perform the above code. I believe it is because of the for loop I am using, which is causing this trouble.

Is there anyway I can extract these certain rows from the CSV file based on the above condition?

Upvotes: 1

Views: 2687

Answers (4)

miraculixx
miraculixx

Reputation: 10359

If this is a repetitive process and/or you have more complex conditions to process, here is a fast, low-memory approach in Python that will get you there quickly:

#!/usr/bin/env python
# put this in parsecsv.py, then chmod +x parsecsv.py
import sys
output = lambda l: sys.stdout.write(l)
for line in sys.stdin:
    fields = line.split(',')
    # add your conditions below
    # call output(line) to output
    if fields[0] == "foo":
         output(line)

This is intended to be used as a pipeline filter from the command line:

$ cat file | parsecsv > extract.csv

Actually I wrote a somewhat more generic & maintainable template that you might find useful .

Upvotes: 1

miraculixx
miraculixx

Reputation: 10359

If this is a one-time task, I would suggest using unix commands first, then process the extract:

cat file | awk -F , '{ if ($5 == "1") print $0 }' > extract.csv

where -F specifies the column delimiter and 5 is the column number. figure this out first by

cat file | head -n 1 | tr ',' '\n' | nl | grep flag_central
=>
5   flag_central
^ this is the field number ($5)

This way you will not incur the cost of converting the csv file into python objects first. Depending on your use case YMMV.

Upvotes: 3

Grr
Grr

Reputation: 16099

You could use Pandas. The only caveat I would have would be that with such a large file you would need to import the file in portions.

import pandas as pd
tp = pd.read_csv('/home/Documents/1681.csv', iterator=True, chunksize=10000)
df = pd.concat(tp, ignore_index=True)

From there you would then be able to extract the row you are interested in with:

rows = df[df['flag-central'] == 1]

If you would like to return this to a csv file you could then use to_csv:

rows.to_csv('filename.csv')

Upvotes: 1

EdChum
EdChum

Reputation: 394159

You can do this using pandas:

import pandas as pd

chunk_list=[]
for chunk in pd.read_csv('/home/Documents/1681.csv', chunksize=10000):
    chunk_list.append(chunk[chunk['flag_central'] == 1]`

final_df = pd.concat(chunk_list)

Basically this will read 10000 rows at a time and filter the rows that don't meet your condition, these get appended to a list and when complete the chunks are concatenated into a final dataframe

Upvotes: 3

Related Questions