Tato14
Tato14

Reputation: 435

Join rows in a data frame which have similar (but not equal) values

I have a df like:

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

And I want to join the rows that share the same chr and strand and that have similar starting and end points (say like with 100 +/- distance). For those columns that the row join is performed, I would also like to concatenate the SampleID names and the Value. With the previous example, something like:

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

Ideas? Thanks!

EDIT:

I found the fuzzyjoin package for R (https://cran.r-project.org/web/packages/fuzzyjoin/index.html). Does anyone have experience with this package?

EDIT2:

It would be also nice if just one of the variables (SampleID or Value) would be concatenated.

Upvotes: 1

Views: 420

Answers (1)

akrun
akrun

Reputation: 886948

We could do group by 'Chr', 'Strand', create a grouping ID based on the difference between adjacent elements in 'Start' and 'End' columns after ordering by 'Start', 'End', then grouped by 'Chr', 'Strand' and 'ind', get the first element of 'Start', 'End', while pasteing the elements in 'SampleID' and 'Value' column

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

NOTE: Assumed that 'df' is a data.table

Upvotes: 1

Related Questions