Reputation: 554
I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.
The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?
Here is what I have so far:
import pandas
colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']
DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)
Upvotes: 5
Views: 2015
Reputation: 1336
You don't need python or any extra R packages to do this. You can use the colClasses
argument with data input in R to only read in certain columns. Give a column a colClasses
value of "NULL"
to prevent it being loaded during input.
Example for a file ("myfile.csv") with 3 columns, if the first should be treated as character
, the third should be treated as integer
, and the second should be left out:
read.csv("myfile.csv", colClasses=c("character", "NULL", "integer"))
This method requires you to specify classes for every column in your file though.
It might also be worth you reading the advice for memory usage at http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you have many rows.
Upvotes: 2
Reputation: 1595
You could achieve this using R and JDBC. Let's create a sample csv file.
write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file
Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.
> library(RJDBC)
> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file
Now you can compose a query and select the columns you need and if needed, add a where clause to filter on the data to only select rows where gear takes the value 5:
> dbGetQuery(conn, "select mpg, cyl, gear from mtcars where gear = '5'")
mpg cyl gear
1 26 4 5
2 30.4 4 5
3 15.8 8 5
4 19.7 6 5
5 15 8 5
Upvotes: 0
Reputation: 49820
In R, you can use the fread
function from the popular data.table package.
You can use the drop=
argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select=
the columns you want to keep, if that is more convenient. fread
can read csv files very, very quickly.
If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.
Alternatively, ?read.csv.sql
from the sqldf package says it will
Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.
Here's the example:
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql="select * from file where Species = 'setosa' ")
Upvotes: 8
Reputation: 42778
If you have a function filter_row
which return True when you want to keep the row, you can use:
with open('input.csv') as inp:
with open('output.csv', 'w') as outp:
for line in inp:
if filter_row(line):
outp.write(line)
Upvotes: 1
Reputation: 21234
Try this:
file = open("csv.csv", "r")
for line in file:
print(line)
It won't load the whole file into memory, it's parsing line by line.
Upvotes: -1
Reputation: 2491
Reading all data into memory first is not a good idea.
There are two ways how to deal with such large data files.
Stream each line, pick relevant information and dismiss the rest until EOF. (see xml parser)
Stream each line and write the entire data into a much better format which let you easily access the relevant information. E.g. pytables
Upvotes: 1
Reputation: 32542
You could use the csv
module to process the file line-by-line. Something like this might work:
import csv
infname = "csv.csv"
outfname = "csv_stripped.csv"
cols = ["col1", "col2", "col3"]
with open(infname) as inf, open(outfname, 'w+') as outf:
reader = csv.DictReader(inf)
writer = csv.DictWriter(outf, cols, extrasaction='ignore')
writer.writeheader()
for line in reader:
writer.writerow(line)
For reference:
Upvotes: 5