dzegpi
dzegpi

Reputation: 688

Pre-filtering with pipe connections and vroom

I want to read a large .txt file into R using the vroom package, because is fast and supports pipe connections for pre-filtering.

For reproducibility, let's read this UK cats csv file from the Tidy Tuesday project and pre-filter for id == "Ares". The first column corresponds to the tag_id.

The following code returns an empty dataframe. How to fix the filter and what changes are required to filter by regular expressions instead of == "Ares"?

cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

vroom(
  file = pipe(paste("awk -F ',' '{ if ($1 == 'Ares') { print } }'", cats_file)),
  delim = ","
)

Upvotes: 2

Views: 211

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269421

1) The file linked to is not so large that you need to do anything special. Try this.

library(dplyr)
library(vroom)

Ares <- cats_file |>
  vroom() |>
  filter(tag_id == "Ares") 

2) This also works

Ares <- cats_file |>
  vroom() |>
  filter(grepl("^Ares$", tag_id))

3) This works too and uses no packages:

Ares <- cats_file |>
  read.csv() |>
  subset(tag_id == "Ares") 

4) Another approach if all the Ares rows are together (which it appears is the case) is to just read in the first column and then use that to pick out the rows needed.

tag_id <- read.csv(cats_file, colClasses = c(NA, rep("NULL", 10)))
rng <- range(grep("Ares", tag_id$tag_id))
Ares <- read.csv(cats_file, skip = rng[1] - 1, nrows = diff(rng) + 1)

An alternative if we knew that the tag_id field is first is to replace the tag_id line above with

tag_id <- read_csv(cats_file, comment_char = ",")

Upvotes: 1

markp-fuso
markp-fuso

Reputation: 33914

Inside an awk script literal string values need to be wrapped in double quotes, eg:

($1 == "Ares")

Single quotes are used to delimit awk script/code; in this case awk sees 3 chunks of script/code:

  • { if ($1 == +
  • Ares +
  • ) { print }}

which awk concatenates into:

{ if ($1 == Ares) { print } }

This translates into awk comparing $1 to whatever's in the variable named Ares which in this case is undefined (aka empty string) so $1 == <empty_string> fails and nothing is printed.

I'm assuming you would need to escape the embedded double quotes, eg:

file = pipe(paste("awk -F ',' '{ if ($1 == \"Ares\") { print } }'", cats_file)),
                                           ^^    ^^

NOTE: I don't work with r/vroom so I'm assuming the rest of OP's code should work once the awk script is modified.


As Ed Morton has mentioned in comments the following should also work:

awk -F ',' '{ if ($1 ~ /Ares/) { print } }'

file = pipe(paste("awk -F ',' '{ if ($1 ~ /Ares/) { print } }'", cats_file)),

####
#### or
####

awk -F ',' '$1 ~ /Ares/'

file = pipe(paste("awk -F ',' '$1 ~ /Ares/`", cats_file)),

####
#### or
####

awk -F ',' '$1 == "Ares"'

file = pipe(paste("awk -F ',' '$1 == \"Ares\"'", cats_file)),

Upvotes: 2

Related Questions