sabsirro
sabsirro

Reputation: 105

How to grep a word exactly

I'd like to grep for "nitrogen" in the following character vector and want to get back only the entry which is containing "nitrogen" and nothing of the rest (e.g. nitrogen fixation):

varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")

I tried something like this:

grepl(pattern= "![[:space:]]nitrogen![[:space:]]", varnames)

But this doesn't work.

Upvotes: 8

Views: 16934

Answers (3)

aL3xa
aL3xa

Reputation: 36080

Or use fixed = TRUE if you want to match actual string (regexlessly):

v <- sample(c("nitrogen", "potassium", "hidrogen"), size = 100, replace = TRUE, prob = c(.8, .1, .1))
grep("nitrogen", v, fixed = TRUE)
# [1]   3   4   5   6   7   8   9  11  12  13  14  16  19  20  21  22  23  24  25
# [20]  26  27  29  31  32  35  36  38  39  40  41  43  44  46  47  48  49  50  51
# [39]  52  53  54  56  57  60  61  62  65  66  67  69  70  71  72  73  74  75  76
# [58]  78  79  80  81  82  83  84  85  86  87  88  89  91  92  93  94  95  96  97
# [77]  98  99 100

Dunno about the speed issues, I like to test stuff and claim that approach A is faster than approach B, but in theory, at least from my experience, indexing/binary operators should be the fastest, so I vote for @Dason's approach. Also note that regexes are always slower than fixed = TRUE greping.

A little proof is attached bellow. Note that this is a lame test, and system.time should be put inside replicate to get (more) accurate differences, you should take outliers into an account, etc. But surely this one proves that you should use which! =)

(a0 <- system.time(replicate(1e5, grep("^nitrogen$", v))))
# user  system elapsed 
# 5.700   0.023   5.724  
(a1 <- system.time(replicate(1e5, grep("nitrogen", v, fixed = TRUE))))
# user  system elapsed 
# 1.147   0.020   1.168 
(a2 <- system.time(replicate(1e5, which(v == "nitrogen"))))
# user  system elapsed 
# 1.013   0.020   1.033 

Upvotes: 2

Dason
Dason

Reputation: 61933

To get the indices that are exactly equal to "nitrogen" you could use

which(varnames == "nitrogen")

Depending on what you want to do you might not even need the 'which' as varnames == "nitrogen" gives a logical vector of TRUE/FALSE. If you just want to do something like replace all of the occurances of "nitrogen" with "oxygen" this should suffice

varnames[varnames == "nitrogen"] <- "oxygen"

Upvotes: 13

thelatemail
thelatemail

Reputation: 93813

Although Dason's answer is easier, you could do an exact match using grep via:

varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")

grep("^nitrogen$",varnames,value=TRUE)
[1] "nitrogen"

grep("^nitrogen$",varnames)
[1] 1

Upvotes: 14

Related Questions