Reputation: 822

Filter out observations present in specific pairs of samples in R

I have a list of observations associated with samples. I would like to remove identical observations that occur in specific pairs of samples.

example of data:

sample observation
sample1A 5
sample1B 7
sample2A 10
sample2B 10
sample3A 10
sample3B 5

So the idea would be to group samples into pairs based on the letters A and B, and then for each of these pairs remove any rows with matching observations.

In the case above only the observations from sample2A and sample 2B would be excluded as they are from the same sample, sample2, sampled on two separate occasions (sample2A & sample 2B). The output would look like:

sample observation
sample1A 5
sample1B 7
sample3A 10
sample3B 5

If it is possible to do this using DPLYR that would be extra useful, as I am trying to improve my proficiency with it.

I imagine that using group_by() to sort the data into groups based on the sample names and then using filter() could work but I am not sure how to handle the nested conditionals of first pairing based on a regular expression or string, then filtering by looking for matching values between rows.

Thanks in advance for your help.

Upvotes: 3

Answers (3)

moodymudskipper

Reputation: 47340

If your format is that regular, you can also do this:

df %>% filter(matrix(.$observation,2) %>% {.[1,]!=.[2,]} %>% rep(each=2))

with only base, and as short as I could:

df[rep(!!diff(matrix(df[[2]],2)),each=2),]

#     sample observation
# 1 sample1A           5
# 2 sample1B           7
# 5 sample3A          10
# 6 sample3B           5

Upvotes: 1

akrun

Reputation: 887691

We can create a group by removing the last character in 'sample' and then filter based on the number of unique 'observation' i.e. if the length is greater than 1, we keep it

library(dplyr)
df2 %>%
  group_by(grp = sub("[A-Z]$", "", sample)) %>%
  filter(n_distinct(observation)>1) %>% 
  ungroup() %>% 
  select(-grp)
# A tibble: 4 x 2
#    sample observation
#      <chr>       <int>
#1 sample1A           5
#2 sample1B           7
#3 sample3A          10
#4 sample3B           5

data

df2 <- structure(list(sample = c("sample1A", "sample1B", "sample2A", 
"sample2B", "sample3A", "sample3B"), observation = c(5L, 7L, 
10L, 10L, 10L, 5L)), .Names = c("sample", "observation"),
 class = "data.frame", row.names = c(NA, -6L))

Upvotes: 5

and-bri

Reputation: 1664

A solution in base with a loop.

# create data
dat <- c(5,7,10,10,10,5)
names(dat) <- c('sample1A', 'sample1B', 'sample2A', 'sample2B', 'sample3A', 'sample3B')
dat

# lets go
pairs <- substr(names(dat), 1, nchar(names(dat))-1)
single <- unique(pairs)

new_dat <- NULL
for(i in 1:length(single)){
  pos <- pairs == single[i]
  if(!any(duplicated(dat[pos]))){
    new_dat <- c(new_dat, dat[pos])
  }
}

new_dat

Upvotes: 1

Filter out observations present in specific pairs of samples in R

Answers (3)

data

Related Questions