Reputation: 127

Finding partial match strings in any column in a dataframe in R

I have a dataframe;

vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status)

vessel           type   class             status
1      a Fishery Vessel      NA                 NA
2      b             NA FISHING                 NA
3      c             NA      NA Engaged in Fishing
4      d          Cargo   CARGO           Underway

I would like to subset the df to contain only those rows relating to fishing (ie rows 1:3) so that means to me doing something like;

df.sub<-subset(grep("FISH", df) | grep("Fish", df))

But this doesn't work. I've been trialing apply (such as this question) or partial string matching using grep (like this question) but I can't seem to pull it all together.

Grateful for any help. My data is 10s of columns and up to a million rows, so trying my best to avoid loops if possible but maybe that's the only way? Thanks!

Upvotes: 1

Answers (3)

akrun

Reputation: 887881

In base R, we can use vectorized option with grepl and Reduce

subset(df, Reduce(`|`, lapply(df[-1], grepl, pattern = 'fish', ignore.case = TRUE)))
#  vessel           type   class             status
#1      a Fishery Vessel      NA                 NA
#2      b             NA FISHING                 NA
#3      c             NA      NA Engaged in Fishing

Upvotes: 0

Mike V

Reputation: 1364

Another option you can try

library(dplyr)
library(stringr)
df %>% 
  filter_all(any_vars(str_detect(., regex("fish", ignore_case =TRUE))))
#   vessel           type   class             status
# 1      a Fishery Vessel      NA                 NA
# 2      b             NA FISHING                 NA
# 3      c             NA      NA Engaged in Fishing

Upvotes: 1

Duck

Reputation: 39613

If you want to use apply() you could compute an index based on your string fish and then subset. The way to compute Index is obtaining the sum of those values which match with fish using grepl(). You can enable ignore.case = T in order to avoid issues with upper or lower case text. When the index is greater or equal to 1 then any match occurred so you can make the subset. Here the code:

#Data
vessel<-c(letters[1:4])
type<-c("Fishery Vessel","NA","NA","Cargo")
class<-c("NA","FISHING","NA","CARGO")
status<-c("NA", "NA", "Engaged in Fishing", "Underway")
df<-data.frame(vessel,type, class, status,stringsAsFactors = F)
#Subset
#Create an index with apply
df$Index <- apply(df[1:4],1,function(x) sum(grepl('fish',x,ignore.case = T)))
#Filter
df.sub<-subset(df,Index>=1)

Output:

  vessel           type   class             status Index
1      a Fishery Vessel      NA                 NA     1
2      b             NA FISHING                 NA     1
3      c             NA      NA Engaged in Fishing     1

Upvotes: 1

Finding partial match strings in any column in a dataframe in R

Answers (3)

Related Questions