Reputation: 173

string match: words + characters

I'm trying to search a dataframe to match a string, where I made an object from a column filled with notes.

As an example:

I'm looking for any row with notes that might match

mph_words<-c(">10", "> 10", ">20", "> 20")

And a row of code may resemble:

> lc_notes[1703]
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph."

As you can see, some of the notes have spaces between "<" or ">" and the number so using strsplit to search wouldn't be ideal because I do need to keep the "<"/">" with the number.

I've tried

> mph_words %in% lc_notes[2000]
[1] FALSE FALSE FALSE FALSE

> pmatch(mph_words, lc_notes[1703])
[1] NA NA NA NA

grepl(lc_notes[1703],mph_words)
[1] FALSE FALSE FALSE FALSE

> str_detect(mph_words,lc_notes[1703])
[1] FALSE FALSE FALSE FALSE

> for (word in 1:length(mph_words)){
+   print(str_extract(mph_words[word],lc_notes[1703]))
+ }
[1] NA
[1] NA
[1] NA
[1] NA

and I'm not sure what to try next. If it's a regex expression, could you possibly just explain it in your answer? I'm trying to understand regex better.

Edit I'm trying to print out rows that specifically have one of the characters in mph_words. So, the code would search each row in my lc_notes and print row 1703.

Thank you in advance!

Upvotes: 1

Answers (3)

niko

Reputation: 5281

Here is a way using strsplit and lapply

# standardize (get rid of white spaces between <,> and digits in mph_words
mph_words <- unique(gsub('([<>])\\s{0,}(\\d+)', '\\1\\2', mph_words, perl = TRUE))        
# match 
check <- lapply(1:length(lc_notes), 
                function (k) any(mph_words %in% unlist(strsplit(lc_notes[k], ' '))))
check
# [[1]]
# [1] TRUE

# [[2]]
# [1] TRUE

# [[3]]
# [1] FALSE

# Finally printing the indices with a match
which(unlist(check))
# [1] 1 2

with the data

mph_words <- c(">10", "> 10", ">20", "> 20")  
lc_notes <- "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph."
lc_notes <- c(lc_notes, 'test >10', '>15')

Upvotes: 1

divibisan

Reputation: 12155

I would use apply with stringr::str_detect for this:

lc_notes <- c("collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph.",
              "collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph.")
mph_words<-c(">10", "> 10", ">20", "> 20")

sapply(lc_notes, function(x) any(str_detect(x, mph_words)))

collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph. 
                                                                    TRUE 
collected 1.667 man-hr total. mostly cloudy, windy with gusts of 20 mph. 
                                                                   FALSE

sapply will loop through each element of the lc_notes vector, applying the test to each. Then by using any, we simplify the vector to a single logical value.

If you want the row numbers, rather than a logical vector, use the which function:

unname(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] 1 2

I used unname here to highlight that the vector this returns is the index of the items in lc_notes that match any of the regex patterns. You can also do the opposite and call names on it to jut get the text of the row:

names(which(sapply(lc_notes, function(x) any(str_detect(x, mph_words)))))
[1] "collected 1.667 man-hr total. mostly cloudy, windy with gusts >20 mph." 
[2] "collected 1.667 man-hr total. mostly cloudy, windy with gusts > 20 mph."

If you want a simpler regex, that matches with or without spaces, use the ? optional quantifier on the space character:

mph_words<-c("> ?10", "> ?20")

Upvotes: 2

G5W

Reputation: 37641

Edited to match edited question:
To find the row numbers, use grep

grep("[<>]\\s*\\d+\\b",  lc_notes)

[<>] matches either < or >
\\s* allows optional whitespace
\\d matches the following numbers.

grep will give the numbers of the lines that match.

Upvotes: 3

string match: words + characters

Answers (3)

Related Questions