Enrique
Enrique

Reputation: 862

Compare two columns and check if the values of other columns have increased or decreased

I have a data.frame with multiples columns. I have one column (sequence) of unique sequences that I want to compare with the next release of this data.frame and check how many peptides they have and check if this number has been increased or decreased.

I take this data.frame from a data base, but the problem is that this data base generates new random sequence positions in each release(See 2º release).

1ºRelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 65       | project 
    2 | tgatgat  | ... | 3        | project 
    3 | actgat   | ... | 32       | project 
    4 | atgtagtt | ... | 25       | project 
    5 | ttttaaat | ... | 32       | project 



2ºrelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 66       | project 
    2 | tgatgat  | ... | 5        | project 
    3 | actgat   | ... | 36       | project 
    4 | ATTTGGGG | ... | 26       | project *** New one ***
    5 | ATTGATGA | ... | 32       | project *** New one ***
    6 | atgtagtt | ... | 47       | project 
    7 | ttttaaat | ... | 38       | project 

I would not have any problem to use duplicate function if in each release the new sequences were put at the end of the column, but unfortunately this is done randomly.

Here you have an example:

1º release:

df <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c("1", "2", "3", "4" ,"5") ), 
sequence = structure(c(1L,2L, 3L, 4L, 5L), 
.Label = c(" actgat   "," atagattg ", " atatagag ", " atggggg  ", " atgtagtt "), class = "factor"), 
peptides = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c(" 54  ", " 84  ",  " 32  ", " 36  ", "12"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA,  -5L))

2º release:

df2 <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L), 
.Label = c("1", "2", "3", "4" ,"5" ,"6", "7" ) ), 
sequence = structure(c(1L,2L, 7L, 8L, 3L, 4L, 5L), 
.Label = c(" actgat   "," atagattg ", " atatagag ", " atggggg  ", " atgtagtt ", " gggatgac ", " TATATCC ", " TTTTAAAT "), class = "factor"), 
peptides = structure(c(1L, 2L,7L,8L, 3L, 4L, 5L), 
.Label = c(" 56  ", " 85  ",  " 31  ", " 36  ", "15", "10", "76", "98", "34", "76"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), 
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA,  -7L))

Upvotes: 0

Views: 614

Answers (3)

EDi
EDi

Reputation: 13280

@Spacedman's solution but with data.table:

library("data.table")
setDT(df, key = 'sequence')
setDT(df2, key = 'sequence')
df2[df]

Or as one-liner (possible with the newest version of data.table) :

library("data.table")
setDT(df2)[df, on="sequence"]

Upvotes: 4

Michael Griffiths
Michael Griffiths

Reputation: 1427

Since you have a key in common, you can use a join.

In tidyverse that would look like this:

library(tidyverse)

df %>% 
  full_join(df2, by = "sequence", suffix = c(".1", ".2")) %>%
  # Fix data to convert to character and numeric
  mutate_each(funs(as.numeric(as.character(.))), starts_with("pept")) %>%
  # See difference
  mutate(change = peptides.2 - peptides.1)

#> Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> factors with different levels, coercing to character vector
#>   ID.1   sequence peptides.1 n_project.1 ID.2 peptides.2 n_project.2  change
#> 1    1  actgat            54    project     1         56    project       2
#> 2    2  atagattg          84    project     2         85    project       1
#> 3    3  atatagag          32    project     5         31    project      -1
#> 4    4  atggggg           36    project     6         36    project       0
#> 5    5  atgtagtt          12    project     7         15    project       3
#> 6   NA   TATATCC          NA        <NA>    3         76    project      NA
#> 7   NA  TTTTAAAT          NA        <NA>    4         98    project      NA

With full_join we see:

  1. What matches between df and df2.
  2. The new rows in df2 (which have NA values for the peptides)
  3. The change in peptides over time.

In this case I've assumed that your sequence data is case sensitive.

Base R

You can also do this in base R with merge, but I prefer the tidyverse syntax above.

merge(df, df2, by = "sequence", all = T)
#>     sequence ID.x peptides.x n_project.x ID.y peptides.y n_project.y
#> 1  actgat       1       54      project     1       56      project 
#> 2  atagattg     2       84      project     2       85      project 
#> 3  atatagag     3       32      project     5       31      project 
#> 4  atggggg      4       36      project     6       36      project 
#> 5  atgtagtt     5         12    project     7         15    project 
#> 6   TATATCC    NA       <NA>        <NA>    3         76    project 
#> 7  TTTTAAAT    NA       <NA>        <NA>    4         98    project

Upvotes: 3

Spacedman
Spacedman

Reputation: 94202

First convert your peptide counts to numeric (they're a factor with numeric character labels, that's a bit messed up):

> df$peptides=as.numeric(as.character(df$peptides))
> df2$peptides=as.numeric(as.character(df2$peptides))

A left join will match the new data onto the old ones:

> require(dplyr)
> left_join(df, df2, c("sequence"="sequence"))
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    3  atatagag          32    project     5         31    project 
4    4  atggggg           36    project     6         36    project 
5    5  atgtagtt          12    project     7         15    project 
Warning message:
In left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
  joining factors with different levels, coercing to character vector

Ignore the warning. A left join and a filter will find the ones where the peptide number has increased:

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y>peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    5  atgtagtt          12    project     7         15    project 

Save that as a new data frame, or whatever.

As a check, the ones that have decreased or unchanged:

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y<=peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    3  atatagag          32    project     5         31    project 
2    4  atggggg           36    project     6         36    project 

Upvotes: 4

Related Questions