Reputation: 704
Please help me out with correct syntax for filtering data on reading very large csv with data.table::fread in MSWindows.
For instance, I would like to keep only rows where the column sex == "T" for total, keep dates in the range of 2020-01-31 to 2020-07-01.
I came across the syntax like this but it is allegedly working only in Linux. Is there a Windows command line equivalent?:
mdata2 <-
data.table::fread(
cmd = "awk '$2 == \"T\" {print {$1}' data/demo2021-01-31_14_30_17.csv",
nThread = 8
)
What would be the correct syntax to filter the specific columns, eliminate some, use operators ==, >,< or filter values within a range of values or dates or from a specific set of values? Is there a tutorial you can recommend?
EDIT1:
The following syntax seems to work on 2nd column:
mdata2 <-
data.table::fread(
cmd = paste('awk -F, \'$2 == "T" \'', "data/demo2021-01-30_23_59_00.csv")
)
Now if I try to expand with another condition, it will not work.
mdata2 <-
data.table::fread(
cmd = paste('awk -F, \'$2 == "T" && $3 == "TOTAL" \'', "data/demo2021-01-30_23_59_00.csv")
)
Trying two conditions return error. It is very confusing.
'(awk , '$2 == "T" && $3 == "TOTAL" ' data/demo2021-01-30_23_59_00.csv) > C:\Users\jkotows2\AppData\Local\Temp\18\Rtmp4iwcUc\file379051a52dfd' execution failed with error code 1File 'C:\Users\jkotows2\AppData\Local\Temp\18\Rtmp4iwcUc\file379051a52dfd' has size 0. Returning a NULL data.table.
EDIT2: After I received a supperb answer from @r2evans I tried the following syntax and it works:
mdata2 <-
data.table::fread(cmd='awk "{ if (NR == 1 || ($2 == "T" && $10 == "TOTAL")) print }" data/demo2021-01-30_23_59_00.csv')
I will be wondering about more examples like getting only selected columns or filtering on a range of dates or a list of strings or regex. It is very interesting.
Upvotes: 0
Views: 768
Reputation: 160607
If you're running on windows, you should have Rtools or Rtools40 installed. If that's the case, then you have awk
available. (If not, then install it: https://cran.r-project.org/bin/windows/Rtools/. It's a heavy lift for just awk
, sure, but if you haven't need it yet to install some packages, the chances are good that you will need it at some point.)
You can verify that it is available with
Sys.which("awk")
# awk
# "c:\\Rtools40\\usr\\bin\\awk.exe"
If instead you get ""
, then you need to find it. Look for the base directory where you have Rtools installed (for me, it's C:/Rtools40/
), then set the path manually:
# set up a general file for awk testing:
write.table(mtcars, file="mt.tab", sep="\t", row.names=FALSE)
### full path to awk
fread(cmd="c:/Rtools40/usr/bin/awk '{print $1}' mt.tab")[1:3,]
# mpg
# <num>
# 1: 21.0
# 2: 21.0
# 3: 22.8
(I'm not suggesting you use this awk code, it's just a space-filler to show that I'm using awk
to somehow process a file that has 11 columns.)
For doing better filtering, something like this works:
data.table::fread(cmd='awk "{ if (NR == 1 || ($2 == 4 && $10 == 4)) print }" mt.tab')
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <int> <num> <int> <num> <num> <num> <int> <int> <int> <int>
# 1: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 3: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 6: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 7: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
NR == 1
is there to ensure we get the header row; if you don't have a header row, you can remove that portion of the conditional statementcyl
($2
) of 4 and gear
($10
) of 4For clarity, the double quotes are important to how fread
is calling system
around the cmd
: when I run it with doubles/singles reversed, it fails:
data.table::fread(cmd="awk '{ if (NR == 1 || ($2 == 4 && $10 == 4)) print }' mt.tab")
# print was unexpected at this time.
# Warning in (if (.Platform$OS.type == "unix") system else shell)(paste0("(", :
# '(awk '{ if (NR == 1 || ($2 == 4 && $10 == 4)) print }' mt.tab) > C:\Users\r2\AppData\Local\Temp\RtmpMVUOli\filed3067f77a57' execution failed with error code 1
# Error in data.table::fread(cmd = "awk '{ if (NR == 1 || ($2 == 4 && $10 == 4)) print }' mt.tab") :
# File 'C:\Users\r2\AppData\Local\Temp\RtmpMVUOli\filed3067f77a57' does not exist or is non-readable. getwd()=='C:/Users/r2/Downloads'
R's system
(and therefore system2
and shell
) are horrible, so it doesn't surprise me wholly that the cmd=
is sensitive to quotes and spaces.
Upvotes: 3