Reputation: 616
To avoid loading a large dataset (> 7 million rows) into R I would like to pre-process it. The intention was to use either grep
or gawk
in fread()
from data.table
package.
Here is a hugely simplified example of the dataset (test.dat
);
0.678 324 765
8.001 109 425
4.425 765 124
3.463 425 765
I would like to match on ID columns 2 and 3 and return column 1. Have tried using grep
with no success really;
library(data.table)
# id <- 425
fread("grep 425 test.dat", header = FALSE)
# V1 V2 V3
# 1: 8.001 109 425
# 2: 4.425 765 124 # <-- incorrect match
# 3: 3.463 425 765
And no, -w
does not help me as suggested on different places on the web as the decimal is considered as punctuation marking end of a word.
Then I tried awk
or in fact gawk
as I am on a Windows 10 machine;
fread("gawk '$2==425 {print $1}' test.dat", header = FALSE)
# V1
# 1: 3.463
Worked for matching on second column, so I extended to match on either column 2 and 3;
fread("gawk '$2==425||$3==425 {print $1}' test.dat", header = FALSE)
# Error in fread("gawk '$2==425||$3==425 {print $1}' test.dat" :
# File is empty: ...
Other attempts;
fread("gawk '$2==\"425\"||$3==\"425\" {print $1}' test.dat", header = FALSE)
# Error in fread("gawk '$2==\"425\"||$3==\"425\" {print $1}' test.dat", :
# File is empty: ...
fread("gawk '{ if ($2==425 || $3==425) print $1; }' test.dat")
# print was unexpected at this time.
# Error in fread("gawk '{ if ($2==425 || $3==425) print $1; }' test.dat", :
# File not found: ...
# In addition: Warning messages:
# 1: running command ... had status 1
# 2: In shell(paste("(", input, ") > ", tt, sep = "")) :
# '(gawk '{ if ($2==425 || $3==425) print $1; }' test.dat) > ..' execution failed with error code 1
I tested the attempts on a linux machine which did give me the desired result. I have been searching and trying different solutions, but to no avail. Anyone got a suggestion?
Here is my sessionInfo()
if needed;
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4 RevoUtilsMath_10.0.0
loaded via a namespace (and not attached):
[1] compiler_3.4.0 RevoUtils_10.0.4 tools_3.4.0
Upvotes: 2
Views: 932
Reputation: 171
supported on windows:
fread('grep -v "^[04]." test.dat', header = T)
# V1 V2 V3
#1: 8.001 109 425
#2: 3.463 425 765
fread('grep "[^.]425" test.dat', header = F)
# V1 V2 V3
#1: 8.001 109 425
#2: 3.463 425 765
Upvotes: 0
Reputation: 6277
I am not sure if it is supported on windows, but here goes a solution that works on linux with grep:
fread("grep ' 425 \\| 425$' test.dat", header = F)
Here is how it work:
the first pattern matches ' 425 '
, which checks for the number of interest enclosed by space (so, it won't match 4.425
or 14251
)
the second pattern matches ' 425[end of line]'
, because the first pattern would not catch that
the two patterns are concatenated using the |
operator, which is doubly escaped \
(the \
character needing to be escaped itself)
In case the csv file is comma separated, this should work:
fread("grep ',425,\\|,425$' test.dat", header = F)
Upvotes: 2