Reputation: 822
I have a list of observations associated with samples. I would like to remove identical observations that occur in specific pairs of samples.
example of data:
sample observation
sample1A 5
sample1B 7
sample2A 10
sample2B 10
sample3A 10
sample3B 5
So the idea would be to group samples into pairs based on the letters A and B, and then for each of these pairs remove any rows with matching observations.
In the case above only the observations from sample2A and sample 2B would be excluded as they are from the same sample, sample2, sampled on two separate occasions (sample2A & sample 2B). The output would look like:
sample observation
sample1A 5
sample1B 7
sample3A 10
sample3B 5
If it is possible to do this using DPLYR that would be extra useful, as I am trying to improve my proficiency with it.
I imagine that using group_by() to sort the data into groups based on the sample names and then using filter() could work but I am not sure how to handle the nested conditionals of first pairing based on a regular expression or string, then filtering by looking for matching values between rows.
Thanks in advance for your help.
Upvotes: 3
Views: 763
Reputation: 47340
If your format is that regular, you can also do this:
df %>% filter(matrix(.$observation,2) %>% {.[1,]!=.[2,]} %>% rep(each=2))
with only base, and as short as I could:
df[rep(!!diff(matrix(df[[2]],2)),each=2),]
# sample observation
# 1 sample1A 5
# 2 sample1B 7
# 5 sample3A 10
# 6 sample3B 5
Upvotes: 1
Reputation: 887691
We can create a group by removing the last character in 'sample' and then filter
based on the number of unique 'observation' i.e. if the length
is greater than 1, we keep it
library(dplyr)
df2 %>%
group_by(grp = sub("[A-Z]$", "", sample)) %>%
filter(n_distinct(observation)>1) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 2
# sample observation
# <chr> <int>
#1 sample1A 5
#2 sample1B 7
#3 sample3A 10
#4 sample3B 5
df2 <- structure(list(sample = c("sample1A", "sample1B", "sample2A",
"sample2B", "sample3A", "sample3B"), observation = c(5L, 7L,
10L, 10L, 10L, 5L)), .Names = c("sample", "observation"),
class = "data.frame", row.names = c(NA, -6L))
Upvotes: 5
Reputation: 1664
A solution in base with a loop.
# create data
dat <- c(5,7,10,10,10,5)
names(dat) <- c('sample1A', 'sample1B', 'sample2A', 'sample2B', 'sample3A', 'sample3B')
dat
# lets go
pairs <- substr(names(dat), 1, nchar(names(dat))-1)
single <- unique(pairs)
new_dat <- NULL
for(i in 1:length(single)){
pos <- pairs == single[i]
if(!any(duplicated(dat[pos]))){
new_dat <- c(new_dat, dat[pos])
}
}
new_dat
Upvotes: 1