Reputation: 331
I have a huge csv file about 15 million row, with size around 3G.
I would like to read this file into R by piece, each time only choose those rows fit into certain condition.
e.g. one of the column is called product type, so I only need to read one type of product into R, and process it then output the result, after that I move to another type of product...
so far I have read about different methods, such as upload the big file into database, or read column by column by colbycol, or read a chunk of rows by ff ...
is any pure R solution can solve my problem?
Upvotes: 11
Views: 9463
Reputation: 1595
You could also use JDBC to achieve this. Let's create a sample csv file.
write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file
Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download and then setup the driver.
> library(RJDBC)
> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd())) # replace getwd() with location of csv file
Let's look at the top 3 rows in the mtcars dataset:
> head(dbGetQuery(conn, "select * from mtcars"), 3)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Next, let's see the distinct values column gear takes and the respective counts:
> dbGetQuery(conn, "select gear, count(*) from mtcars group by gear")
GEAR COUNT(*)
1 4 12
2 3 15
3 5 5
Now you can compose a query using the where clause to filter on the data to only select rows where gear takes the value 5:
> dbGetQuery(conn, "select * from mtcars where gear = '5'")
mpg cyl disp hp drat wt qsec vs am gear carb
1 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
3 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
5 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Upvotes: 0
Reputation: 4223
You can use the RSQLite
package:
library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")
# read csv file into sql database
# Warning: this is going to take some time and disk space,
# as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv",
row.names=FALSE, header=TRUE, sep = ",")
# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")
dbDisconnect(con)
Next time you want to access your data you can leave out the dbWriteTable
, as the SQLite table is stored on disk.
Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.
Upvotes: 20
Reputation: 49640
It is possible to do this only using R:
read.csv
specifying colClasses
and nrows=1
While the above is possible, I don't think that it is advisable. This type of thing is probably better done by loading the data into a database, then querying the database from within R.
Upvotes: 5