ModalBro
ModalBro

Reputation: 554

Using Python to parse a 12GB CSV

I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.

The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?

Here is what I have so far:

import pandas

colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']

DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)

Upvotes: 5

Views: 2015

Answers (7)

ping
ping

Reputation: 1336

You don't need python or any extra R packages to do this. You can use the colClasses argument with data input in R to only read in certain columns. Give a column a colClasses value of "NULL" to prevent it being loaded during input.

Example for a file ("myfile.csv") with 3 columns, if the first should be treated as character, the third should be treated as integer, and the second should be left out:

read.csv("myfile.csv", colClasses=c("character", "NULL", "integer"))

This method requires you to specify classes for every column in your file though.

It might also be worth you reading the advice for memory usage at http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you have many rows.

Upvotes: 2

Rahul Premraj
Rahul Premraj

Reputation: 1595

You could achieve this using R and JDBC. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file

Now you can compose a query and select the columns you need and if needed, add a where clause to filter on the data to only select rows where gear takes the value 5:

> dbGetQuery(conn, "select mpg, cyl, gear from mtcars where gear = '5'")
   mpg cyl gear
1   26   4    5
2 30.4   4    5
3 15.8   8    5
4 19.7   6    5
5   15   8    5

Upvotes: 0

GSee
GSee

Reputation: 49820

In R, you can use the fread function from the popular data.table package.

You can use the drop= argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select= the columns you want to keep, if that is more convenient. fread can read csv files very, very quickly.

If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.


Alternatively, ?read.csv.sql from the sqldf package says it will

Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.

Here's the example:

write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
                      sql="select * from file where Species = 'setosa' ")

Upvotes: 8

Daniel
Daniel

Reputation: 42778

If you have a function filter_row which return True when you want to keep the row, you can use:

with open('input.csv') as inp:
    with open('output.csv', 'w') as outp:
        for line in inp:
            if filter_row(line):
                outp.write(line)

Upvotes: 1

RydallCooper
RydallCooper

Reputation: 21234

Try this:

    file = open("csv.csv", "r")
    for line in file: 
        print(line)

It won't load the whole file into memory, it's parsing line by line.

Upvotes: -1

Bort
Bort

Reputation: 2491

Reading all data into memory first is not a good idea.

There are two ways how to deal with such large data files.

  1. Stream each line, pick relevant information and dismiss the rest until EOF. (see xml parser)

  2. Stream each line and write the entire data into a much better format which let you easily access the relevant information. E.g. pytables

Upvotes: 1

moooeeeep
moooeeeep

Reputation: 32542

You could use the csv module to process the file line-by-line. Something like this might work:

import csv
infname = "csv.csv"
outfname = "csv_stripped.csv"
cols = ["col1", "col2", "col3"]
with open(infname) as inf, open(outfname, 'w+') as outf:
    reader = csv.DictReader(inf)
    writer = csv.DictWriter(outf, cols, extrasaction='ignore')
    writer.writeheader()
    for line in reader:
        writer.writerow(line)

For reference:

Upvotes: 5

Related Questions