Reputation: 688
I want to read a large .txt file into R using the vroom
package, because is fast and supports pipe connections for pre-filtering.
For reproducibility, let's read this UK cats csv file from the Tidy Tuesday project and pre-filter for id == "Ares". The first column corresponds to the tag_id.
The following code returns an empty dataframe. How to fix the filter and what changes are required to filter by regular expressions instead of == "Ares"?
cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"
vroom(
file = pipe(paste("awk -F ',' '{ if ($1 == 'Ares') { print } }'", cats_file)),
delim = ","
)
Upvotes: 2
Views: 211
Reputation: 269421
1) The file linked to is not so large that you need to do anything special. Try this.
library(dplyr)
library(vroom)
Ares <- cats_file |>
vroom() |>
filter(tag_id == "Ares")
2) This also works
Ares <- cats_file |>
vroom() |>
filter(grepl("^Ares$", tag_id))
3) This works too and uses no packages:
Ares <- cats_file |>
read.csv() |>
subset(tag_id == "Ares")
4) Another approach if all the Ares rows are together (which it appears is the case) is to just read in the first column and then use that to pick out the rows needed.
tag_id <- read.csv(cats_file, colClasses = c(NA, rep("NULL", 10)))
rng <- range(grep("Ares", tag_id$tag_id))
Ares <- read.csv(cats_file, skip = rng[1] - 1, nrows = diff(rng) + 1)
An alternative if we knew that the tag_id field is first is to replace the tag_id
line above with
tag_id <- read_csv(cats_file, comment_char = ",")
Upvotes: 1
Reputation: 33914
Inside an awk
script literal string values need to be wrapped in double quotes, eg:
($1 == "Ares")
Single quotes are used to delimit awk
script/code; in this case awk
sees 3 chunks of script/code:
{ if ($1 ==
+Ares
+) { print }}
which awk
concatenates into:
{ if ($1 == Ares) { print } }
This translates into awk
comparing $1
to whatever's in the variable named Ares
which in this case is undefined (aka empty string) so $1 == <empty_string>
fails and nothing is printed.
I'm assuming you would need to escape the embedded double quotes, eg:
file = pipe(paste("awk -F ',' '{ if ($1 == \"Ares\") { print } }'", cats_file)),
^^ ^^
NOTE: I don't work with r/vroom
so I'm assuming the rest of OP's code should work once the awk
script is modified.
As Ed Morton has mentioned in comments the following should also work:
awk -F ',' '{ if ($1 ~ /Ares/) { print } }'
file = pipe(paste("awk -F ',' '{ if ($1 ~ /Ares/) { print } }'", cats_file)),
####
#### or
####
awk -F ',' '$1 ~ /Ares/'
file = pipe(paste("awk -F ',' '$1 ~ /Ares/`", cats_file)),
####
#### or
####
awk -F ',' '$1 == "Ares"'
file = pipe(paste("awk -F ',' '$1 == \"Ares\"'", cats_file)),
Upvotes: 2