user594694
user594694

Reputation: 337

Regular expression-based list matching in R

I have two lists (more exactly, character atomic vectors) that I want to compare using regular expressions to produce a sub-set of one of the lists. I can use a 'for' loop for this, but is there some simpler code? Following exemplifies my case:

# list of unique cities
city <- c('Berlin', 'Perth', 'Oslo')

# list of city-months, like 'New York-Dec'
temp <- c('Berlin-Jan', 'Delhi-Jan', 'Lima-Feb', 'Perth-Feb', 'Oslo-Jan')

# need sub-set of 'temp' for only 'Jan' month for only the items in 'city' list:
#   'Berlin-Jan', 'Oslo-Jan'

Added clarification: In the actual case that I am seeking code for, the values of the 'month' equivalent are more complex, and rather random alphanumeric values with only the first two characters having informational value of my interest (has to be '01').

Added actual case example:

# equivalent of 'city' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}
patient <- c('TCGA-43-4897', 'TCGA-65-4897', 'TCGA-78-8904', 'TCGA-90-8984')

# equivalent of 'temp' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-[\d]{2}[0-9A-Z]+
sample <- c('TCGA-21-5732-01A333', 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76', 'TCGA-78-8904-11A70')

# sub-set wanted (must have '01' after the 'patient' ID part)
#   'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76'

Upvotes: 2

Views: 238

Answers (4)

Arun
Arun

Reputation: 118839

Something like this?

temp <- temp[grepl("Jan", temp)]
temp[sapply(strsplit(temp, "-"), "[[", 1) %in% city]
# [1] "Berlin-Jan" "Oslo-Jan"  

Even better, borrowing the idea from @agstudy:

> temp[temp %in% paste0(city, "-Jan")]
# [1] "Berlin-Jan" "Oslo-Jan"  

Edit: How about this?

> sample[gsub("(.*-01).*$", "\\1", sample) %in% paste0(patient, "-01")]
# [1] "TCGA-43-4897-01A159" "TCGA-65-4897-01T76" 

Upvotes: 4

Jesse Anderson
Jesse Anderson

Reputation: 4603

Here's a solution after the others, with your new requirements:

sample[na.omit(pmatch(paste0(patient, '-01'), sample))]

Upvotes: 3

alexwhan
alexwhan

Reputation: 16026

Here's a solution with two partial string matches...

temp[agrep("Jan",temp)[which(agrep("Jan",temp) %in% sapply(city, agrep, x=temp))]]
# [1] "Berlin-Jan" "Oslo-Jan" 

As a function just for fun...

fun <- function(x,y,pattern) y[agrep(pattern,y)[which(agrep(pattern,y) %in% sapply(x, agrep, x=y))]]
# x is a vector containing your data for filter
# y is a vector containing the data to filter on
# pattern is the quoted pattern you're filtering on

fun(temp, city, "Jan")
# [1] "Berlin-Jan" "Oslo-Jan" 

Upvotes: 1

agstudy
agstudy

Reputation: 121588

You can use gsub

x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp)
> temp[x==1]
[1] "Berlin-Jan" "Oslo-Jan"  

the pattern here is :

 "Berlin-Jan|Perth-Jan|Oslo-Jan"

Upvotes: 2

Related Questions