song
song

Reputation: 11

how to perform keyword search (not exact) on all columns of a dataframe in R

I have exported old files from a legacy system. There isn't any documentation and I have to search for the data in a dataframe/matrix that as more than 300 columns.

for example, using the following representative data

a <- c("jan", "mar", "jan", "feb", "feb")
b <- c("feb", "mar", "mar", "january", "mar")
c <- c("jan", "feb", "feb", "jan", "jan")
d <- c("jan", "mar", "jan", "february", "feb")
e <- c("feb", "jan", "feb", "march", "mar")
f <- c("january", "february", "feb", "jan", "janet")
xxx <- data.frame(a,b,c,d,e,f) 
xxx 

I need to be able to search for "Jan" and all data elements including "Jan", "January", "Janet" should show up.

Tried using

which(xxx =="Jan", arr.ind=TRUE) 

but it will only give me a exact match.

Is there a way to wild card the above or another way to implement a search function on a big set of data which I am trying to make sense of.

Upvotes: 1

Views: 445

Answers (2)

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

which(sapply(xxx, function(x) grepl(pattern = "jan", x = x)), arr.ind=TRUE)

#       row col
# [1,]   1   1
# [2,]   3   1
# [3,]   4   2
# [4,]   1   3
# [5,]   4   3
# [6,]   5   3
# [7,]   1   4
# [8,]   3   4
# [9,]   2   5
#[10,]   1   6
#[11,]   4   6
#[12,]   5   6

Upvotes: 1

www
www

Reputation: 39174

Not sure your desired output, but the following code returns a list with matching word from each column.

lapply(xxx, function(col) grep(pattern = "jan", x = col, value = TRUE))
# $a
# [1] "jan" "jan"
# 
# $b
# [1] "january"
# 
# $c
# [1] "jan" "jan" "jan"
# 
# $d
# [1] "jan" "jan"
# 
# $e
# [1] "jan"
# 
# $f
# [1] "january" "jan"     "janet" 

Without value = TRUE, the same code returns the index of the matching word.

lapply(xxx, function(col) grep(pattern = "jan", x = col))
# $a
# [1] 1 3
# 
# $b
# [1] 4
# 
# $c
# [1] 1 4 5
# 
# $d
# [1] 1 3
# 
# $e
# [1] 2
# 
# $f
# [1] 1 4 5

If you replace grep with grepl, the code would return a list of logical vector showing if words matched.

lapply(xxx, function(col) grepl(pattern = "jan", x = col))
# $a
# [1]  TRUE FALSE  TRUE FALSE FALSE
# 
# $b
# [1] FALSE FALSE FALSE  TRUE FALSE
# 
# $c
# [1]  TRUE FALSE FALSE  TRUE  TRUE
# 
# $d
# [1]  TRUE FALSE  TRUE FALSE FALSE
# 
# $e
# [1] FALSE  TRUE FALSE FALSE FALSE
# 
# $f
# [1]  TRUE FALSE FALSE  TRUE  TRUE

Upvotes: 2

Related Questions