Reputation: 435
I have a df
like:
SampleID Chr Start End Strand Value
1: rep1 1 11001 12000 - 10
2: rep1 1 15000 20100 - 5
3: rep2 1 11070 12050 - 1
4: rep3 1 14950 20090 + 20
...
And I want to join the rows that share the same chr
and strand
and that have similar starting and end points (say like with 100 +/- distance). For those columns that the row join is performed, I would also like to concatenate the SampleID
names and the Value
. With the previous example, something like:
SampleID Chr Start End Strand Value
1:rep1,rep2 1 11001 12000 - 10,1
2: rep1 1 15000 20100 - 5
4: rep3 1 14950 20090 + 20
...
Ideas? Thanks!
EDIT:
I found the fuzzyjoin package for R (https://cran.r-project.org/web/packages/fuzzyjoin/index.html). Does anyone have experience with this package?
EDIT2:
It would be also nice if just one of the variables (SampleID
or Value
) would be concatenated.
Upvotes: 1
Views: 420
Reputation: 886948
We could do group by 'Chr', 'Strand', create a grouping ID based on the difference between adjacent elements in 'Start' and 'End' columns after order
ing by 'Start', 'End', then grouped by 'Chr', 'Strand' and 'ind', get the first element of 'Start', 'End', while paste
ing the elements in 'SampleID' and 'Value' column
library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 &
(End - shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
][, .(Start = Start[1], End = End[1],
SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
# Strand Chr ind Start End SampleID Value
#1: - 1 1 11001 12000 rep1, rep2 10, 1
#2: - 1 2 15000 20100 rep1 5
#3: + 1 1 14950 20090 rep3 20
NOTE: Assumed that 'df' is a data.table
Upvotes: 1