Reputation: 173
I'm trying to search a dataframe to match a string, where I made an object from a column filled with notes.
As an example:
I'm looking for any row with notes that might match
mph_words<-c(">10", "> 10", ">20", "> 20")
And a row of code may resemble:
> lc_notes[1703]
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph."
As you can see, some of the notes have spaces between "<" or ">" and the number so using strsplit to search wouldn't be ideal because I do need to keep the "<"/">" with the number.
I've tried
> mph_words %in% lc_notes[2000]
[1] FALSE FALSE FALSE FALSE
> pmatch(mph_words, lc_notes[1703])
[1] NA NA NA NA
grepl(lc_notes[1703],mph_words)
[1] FALSE FALSE FALSE FALSE
> str_detect(mph_words,lc_notes[1703])
[1] FALSE FALSE FALSE FALSE
> for (word in 1:length(mph_words)){
+ print(str_extract(mph_words[word],lc_notes[1703]))
+ }
[1] NA
[1] NA
[1] NA
[1] NA
and I'm not sure what to try next. If it's a regex expression, could you possibly just explain it in your answer? I'm trying to understand regex better.
Edit I'm trying to print out rows that specifically have one of the characters in mph_words. So, the code would search each row in my lc_notes and print row 1703.
Thank you in advance!
Upvotes: 1
Views: 51
Reputation: 5281
Here is a way using strsplit
and lapply
# standardize (get rid of white spaces between <,> and digits in mph_words
mph_words <- unique(gsub('([<>])\\s{0,}(\\d+)', '\\1\\2', mph_words, perl = TRUE))
# match
check <- lapply(1:length(lc_notes),
function (k) any(mph_words %in% unlist(strsplit(lc_notes[k], ' '))))
check
# [[1]]
# [1] TRUE
# [[2]]
# [1] TRUE
# [[3]]
# [1] FALSE
# Finally printing the indices with a match
which(unlist(check))
# [1] 1 2
with the data
mph_words <- c(">10", "> 10", ">20", "> 20")
lc_notes <- "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph."
lc_notes <- c(lc_notes, 'test >10', '>15')
Upvotes: 1
Reputation: 12155
I would use apply
with stringr::str_detect
for this:
lc_notes <- c("collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph.",
"collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph.",
"collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph.")
mph_words<-c(">10", "> 10", ">20", "> 20")
sapply(lc_notes, function(x) any(str_detect(x, mph_words)))
collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph.
TRUE
collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph.
TRUE
collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph.
FALSE
sapply
will loop through each element of the lc_notes
vector, applying the test to each. Then by using any
, we simplify the vector to a single logical value.
If you want the row numbers, rather than a logical vector, use the which
function:
unname(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] 1 2
I used unname
here to highlight that the vector this returns is the index of the items in lc_notes
that match any of the regex patterns. You can also do the opposite and call names
on it to jut get the text of the row:
names(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph."
[2] "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph."
If you want a simpler regex, that matches with or without spaces, use the ?
optional quantifier on the space character:
mph_words<-c("> ?10", "> ?20")
Upvotes: 2
Reputation: 37641
Edited to match edited question:
To find the row numbers, use grep
grep("[<>]\\s*\\d+\\b", lc_notes)
[<>]
matches either < or >
\\s*
allows optional whitespace
\\d
matches the following numbers.
grep will give the numbers of the lines that match.
Upvotes: 3