Reputation: 970
I have a dataframe that looks like this toy example:
Ind ID RegionStart RegionEnd Value TN
1 A 1 100 3 N
1 A 101 200 2 N
2 A 1 100 3 T
2 A 101 200 2 T
3 B 1 100 3 N
3 B 101 200 2 N
4 B 1 100 5 T
4 B 101 200 5 T
I have 4 individuals, which are actually 2 pairs (a reference, N, and a subject, T). For simplicity, there are only 2 pairs, and only 2 regions. In my real file, there are >500 pairs and >60,000 regions. The regions all have the same start and end, so there are no overlaps.
What I want to do is MATCH individuals based on ID
+ region
, and if
Value
of N-individual at that region is != 3 (not equal to 3) & Value
for N-individual & T-individual match in that region (e.g. N-ind = 2 & T-ind = 2), then change the corresponding Value
in both N & T individuals to 3.
The resulting table from above would be:
Ind ID RegionStart RegionEnd Value TN
1 A 1 100 3 N
1 A 101 200 3 N
2 A 1 100 3 T
2 A 101 200 3 T
3 B 1 100 3 N
3 B 101 200 2 N
4 B 1 100 5 T
4 B 101 200 5 T
Note that ID=B
, region 1-100
did not change Value
s because N's Value = 3
; region 101-200
, did not change because Value
s for N & T were not the same.
I thought of using dplyr to group the matches, for example:
df <- df %>% arrange(ID, Ind, RegionStart, TN) %>% group_by(ID)
Or perhaps using data.table, but setting ID as the key? But I'm still not sure how to easily compare rows. I'm still pretty new to dplyr & data.table, so a short explanation of the command would be great if you use these. Feel free to use another package, though. The data is pretty large though, so it needs to be efficient.
Upvotes: 1
Views: 122
Reputation: 66819
With data.table:
library(data.table)
setDT(DF)
DF[, Value := {
fixit = ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )
if (fixit) 3L else Value
}, by=.(ID, RegionStart)]
Note that this will change your original data set (rather than simply returning an altered table).
With dplyr:
library(dplyr)
DF %>% group_by(ID, RegionStart) %>%
mutate(Value = {
fixit = ( Value[TN=="N"] != 3L ) & ( n_distinct(Value) == 1L )
if (fixit) 3L else Value
})
How it works: uniqueN
and n_distinct
count the number of distinct values in a vector. If both elements of Value
are the same, then this will return 1L
.
Upvotes: 4