user3491700
user3491700

Reputation: 83

How to pull cells and rows from a very big .csv into R with SQL?

Is there a way to read in some data from a csv using a SQL query and pass those into a dataframe? i.e. put the results of an aggregate function SQL query into a new dataframe? Normally I would read in and modify/use the whole csv but it is too large in size.

Upvotes: 0

Views: 582

Answers (1)

Shawn Mehan
Shawn Mehan

Reputation: 4578

If you go and read sqldf you will find the specific function

read.csv.sql Read File Filtered by SQL

which

Description Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.

Usage

read.csv.sql(file, sql = "select * from file", header = TRUE, sep = ",",
row.names, eol, skip, filter, nrows, field.types,
colClasses, dbname = tempfile(), drv = "SQLite", ...)
read.csv2.sql(file, sql = "select * from file", header = TRUE, sep = ";",
row.names, eol, skip, filter, nrows, field.types,
colClasses, dbname = tempfile(), drv = "SQLite", ...)

Arguments

  1. file A file path or a URL (beginning with http:// or ftp://). If the filter ar- gument is used and no file is to be input to the filter then file can be omitted, NULL, NA or "".
  2. sql character string holding an SQL statement. The table representing the file should be referred to as file.
  3. header As in read.csv.
  4. sep As in read.csv.
  5. row.names As in read.csv.
  6. eol Character which ends line.
  7. skip indicated number of lines in input file. If specified, this should be a shell/batch command that the input file is piped through. For read.csv2.sql it is by default the following on non-Windows systems: tr , .. This translates all commas in the file to dots. On Windows similar functionalty is provided but to do that using a vbscript file that is included with sqldf to emulate the tr command.
  8. nrows Number of rows used to determine column types. It defaults to 50. Using -1 causes it to use all rows for determining column types. This argument is rarely needed.
  9. field types A list whose names are the column names and whose contents are the SQLite types (not the R class names) of the columns. Specifying these types improves how fast it takes. Unless speed is very important this argument is not normally used.
  10. colClasses As in read.csv.
  11. dbname As in sqldf except that the default is tempfile(). Specifying NULL will put the database in memory which may improve speed but will limit the size of the database by the available memory.

Upvotes: 1

Related Questions