Reputation: 862
I have a data.frame with multiples columns. I have one column (sequence) of unique sequences that I want to compare with the next release of this data.frame and check how many peptides they have and check if this number has been increased or decreased.
I take this data.frame from a data base, but the problem is that this data base generates new random sequence positions in each release(See 2º release).
1ºRelease
ID | sequence | ... | Peptides | nºproject
1 | atggggg | ... | 65 | project
2 | tgatgat | ... | 3 | project
3 | actgat | ... | 32 | project
4 | atgtagtt | ... | 25 | project
5 | ttttaaat | ... | 32 | project
2ºrelease
ID | sequence | ... | Peptides | nºproject
1 | atggggg | ... | 66 | project
2 | tgatgat | ... | 5 | project
3 | actgat | ... | 36 | project
4 | ATTTGGGG | ... | 26 | project *** New one ***
5 | ATTGATGA | ... | 32 | project *** New one ***
6 | atgtagtt | ... | 47 | project
7 | ttttaaat | ... | 38 | project
I would not have any problem to use duplicate function if in each release the new sequences were put at the end of the column, but unfortunately this is done randomly.
Here you have an example:
1º release:
df <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4" ,"5") ),
sequence = structure(c(1L,2L, 3L, 4L, 5L),
.Label = c(" actgat "," atagattg ", " atatagag ", " atggggg ", " atgtagtt "), class = "factor"),
peptides = structure(c(1L, 2L, 3L, 4L, 5L),
.Label = c(" 54 ", " 84 ", " 32 ", " 36 ", "12"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA, -5L))
2º release:
df2 <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L),
.Label = c("1", "2", "3", "4" ,"5" ,"6", "7" ) ),
sequence = structure(c(1L,2L, 7L, 8L, 3L, 4L, 5L),
.Label = c(" actgat "," atagattg ", " atatagag ", " atggggg ", " atgtagtt ", " gggatgac ", " TATATCC ", " TTTTAAAT "), class = "factor"),
peptides = structure(c(1L, 2L,7L,8L, 3L, 4L, 5L),
.Label = c(" 56 ", " 85 ", " 31 ", " 36 ", "15", "10", "76", "98", "34", "76"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 0
Views: 614
Reputation: 13280
@Spacedman's solution but with data.table
:
library("data.table")
setDT(df, key = 'sequence')
setDT(df2, key = 'sequence')
df2[df]
Or as one-liner (possible with the newest version of data.table) :
library("data.table")
setDT(df2)[df, on="sequence"]
Upvotes: 4
Reputation: 1427
Since you have a key in common, you can use a join
.
In tidyverse
that would look like this:
library(tidyverse)
df %>%
full_join(df2, by = "sequence", suffix = c(".1", ".2")) %>%
# Fix data to convert to character and numeric
mutate_each(funs(as.numeric(as.character(.))), starts_with("pept")) %>%
# See difference
mutate(change = peptides.2 - peptides.1)
#> Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> factors with different levels, coercing to character vector
#> ID.1 sequence peptides.1 n_project.1 ID.2 peptides.2 n_project.2 change
#> 1 1 actgat 54 project 1 56 project 2
#> 2 2 atagattg 84 project 2 85 project 1
#> 3 3 atatagag 32 project 5 31 project -1
#> 4 4 atggggg 36 project 6 36 project 0
#> 5 5 atgtagtt 12 project 7 15 project 3
#> 6 NA TATATCC NA <NA> 3 76 project NA
#> 7 NA TTTTAAAT NA <NA> 4 98 project NA
With full_join
we see:
df
and df2
.df2
(which have NA
values for the peptides)In this case I've assumed that your sequence
data is case sensitive.
You can also do this in base R with merge
, but I prefer the tidyverse
syntax above.
merge(df, df2, by = "sequence", all = T)
#> sequence ID.x peptides.x n_project.x ID.y peptides.y n_project.y
#> 1 actgat 1 54 project 1 56 project
#> 2 atagattg 2 84 project 2 85 project
#> 3 atatagag 3 32 project 5 31 project
#> 4 atggggg 4 36 project 6 36 project
#> 5 atgtagtt 5 12 project 7 15 project
#> 6 TATATCC NA <NA> <NA> 3 76 project
#> 7 TTTTAAAT NA <NA> <NA> 4 98 project
Upvotes: 3
Reputation: 94202
First convert your peptide counts to numeric (they're a factor with numeric character labels, that's a bit messed up):
> df$peptides=as.numeric(as.character(df$peptides))
> df2$peptides=as.numeric(as.character(df2$peptides))
A left join will match the new data onto the old ones:
> require(dplyr)
> left_join(df, df2, c("sequence"="sequence"))
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 1 actgat 54 project 1 56 project
2 2 atagattg 84 project 2 85 project
3 3 atatagag 32 project 5 31 project
4 4 atggggg 36 project 6 36 project
5 5 atgtagtt 12 project 7 15 project
Warning message:
In left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
joining factors with different levels, coercing to character vector
Ignore the warning. A left join and a filter will find the ones where the peptide number has increased:
> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y>peptides.x)
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 1 actgat 54 project 1 56 project
2 2 atagattg 84 project 2 85 project
3 5 atgtagtt 12 project 7 15 project
Save that as a new data frame, or whatever.
As a check, the ones that have decreased or unchanged:
> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y<=peptides.x)
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 3 atatagag 32 project 5 31 project
2 4 atggggg 36 project 6 36 project
Upvotes: 4