Reputation: 331

how to read huge csv file into R by row condition?

I have a huge csv file about 15 million row, with size around 3G.

I would like to read this file into R by piece, each time only choose those rows fit into certain condition.

e.g. one of the column is called product type, so I only need to read one type of product into R, and process it then output the result, after that I move to another type of product...

so far I have read about different methods, such as upload the big file into database, or read column by column by colbycol, or read a chunk of rows by ff ...

is any pure R solution can solve my problem?

Upvotes: 11

Answers (3)

Rahul Premraj

Reputation: 1595

You could also use JDBC to achieve this. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file

Let's look at the top 3 rows in the mtcars dataset:

> head(dbGetQuery(conn, "select * from mtcars"), 3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   21   6  160 110  3.9  2.62 16.46  0  1    4    4
2   21   6  160 110  3.9 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1

Next, let's see the distinct values column gear takes and the respective counts:

> dbGetQuery(conn, "select gear, count(*) from mtcars group by gear")
  GEAR COUNT(*)
1    4       12
2    3       15
3    5        5

Now you can compose a query using the where clause to filter on the data to only select rows where gear takes the value 5:

> dbGetQuery(conn, "select * from mtcars where gear = '5'")
   mpg cyl  disp  hp drat    wt qsec vs am gear carb
1   26   4 120.3  91 4.43  2.14 16.7  0  1    5    2
2 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
3 15.8   8   351 264 4.22  3.17 14.5  0  1    5    4
4 19.7   6   145 175 3.62  2.77 15.5  0  1    5    6
5   15   8   301 335 3.54  3.57 14.6  0  1    5    8

Upvotes: 0

ROLO

Reputation: 4223

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space, 
#   as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv", 
    row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

Upvotes: 20

Greg Snow

Reputation: 49660

It is possible to do this only using R:

open a connection to the file
if there is a header, then read in the header information
read one line from the file using read.csv specifying colClasses and nrows=1
test that line to see if it meets your conditions, append it to the growing data frame if yes
repeat step 4 for the rest of the file.
close the connection

While the above is possible, I don't think that it is advisable. This type of thing is probably better done by loading the data into a database, then querying the database from within R.

Upvotes: 5

how to read huge csv file into R by row condition?

Answers (3)

Related Questions