syre
syre

Reputation: 982

List patterns by string in which they are found, in R

Based on this answer, how can we list the results in a more compact single column, in case we are matching many patterns but expect to get only few hits per string? (I am not sure of the most orthodox format for the "hits" column, whether a vector as below, or a delimited string.)

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)

#The original solution and output
sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))
#                   berber   weg
#berberichweg        TRUE  TRUE
#otto-klemperer-weg  FALSE TRUE
#feldmeierbogen      FALSE FALSE
#altostraße          FALSE FALSE

#The desired output instead
#streets            hits
#berberichweg       c("berber", "weg")
#otto-klemperer-weg "weg"
#feldmeierbogen     NA
#altostraße         NA

Upvotes: 0

Views: 53

Answers (1)

r2evans
r2evans

Reputation: 160607

res <- sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))
res
#                    berber   weg
# berberichweg         TRUE  TRUE
# otto-klemperer-weg  FALSE  TRUE
# feldmeierbogen      FALSE FALSE
# altostraße          FALSE FALSE
dat <- data.frame(streets = streets)
dat$hits1 <- names[apply(res, 1, function(z) if (any(z)) which.max(z) else NA)]
dat
#              streets  hits1
# 1       berberichweg berber
# 2 otto-klemperer-weg    weg
# 3     feldmeierbogen   <NA>
# 4         altostraße   <NA>
dat$hits1
# [1] "berber" "weg"    NA       NA      

If instead you want one string per result, perhaps

dat$hits2 <- apply(res, 1, function(z) toString(names(which(z))))
dat
#              streets  hits1       hits2
# 1       berberichweg berber berber, weg
# 2 otto-klemperer-weg    weg         weg
# 3     feldmeierbogen   <NA>            
# 4         altostraße   <NA>            
dat$hits2
# [1] "berber, weg" "weg"         ""            ""           

Noting that the first is a single comma-delimited string, not a vector of strings. An alternative would be to use a list-column instead,

dat$hits3 <- apply(res, 1, function(z) names(which(z)))
dat
#              streets  hits1       hits2       hits3
# 1       berberichweg berber berber, weg berber, weg
# 2 otto-klemperer-weg    weg         weg         weg
# 3     feldmeierbogen   <NA>                        
# 4         altostraße   <NA>                        
dat$hits3
# $berberichweg
# [1] "berber" "weg"   
# $`otto-klemperer-weg`
# [1] "weg"
# $feldmeierbogen
# character(0)
# $altostraße
# character(0)

This is a list, which can be assigned into a frame. Two things to note about this:

  1. You'll need to use [[ to grab a single "cell" from this hits3:

    dat$hits1[1]
    # [1] "berber"
    dat$hits2[1]
    # [1] "berber, weg"
    dat$hits3[1]
    # $berberichweg             # <---- this is a list, not a vector, of length 1
    # [1] "berber" "weg"   
    dat$hits3[[1]]
    # [1] "berber" "weg"   
    
  2. Anything that works on this column will need to be list-friendly, since it is not a vector.

Upvotes: 3

Related Questions