Gaius Augustus
Gaius Augustus

Reputation: 970

R - Changing values based on reference

I have a dataframe that looks like this toy example:

Ind     ID    RegionStart     RegionEnd    Value     TN
1       A       1              100           3       N
1       A       101            200           2       N
2       A       1              100           3       T
2       A       101            200           2       T
3       B       1              100           3       N
3       B       101            200           2       N
4       B       1              100           5       T
4       B       101            200           5       T

I have 4 individuals, which are actually 2 pairs (a reference, N, and a subject, T). For simplicity, there are only 2 pairs, and only 2 regions. In my real file, there are >500 pairs and >60,000 regions. The regions all have the same start and end, so there are no overlaps.

What I want to do is MATCH individuals based on ID + region, and if

then change the corresponding Value in both N & T individuals to 3.

The resulting table from above would be:

Ind     ID    RegionStart     RegionEnd    Value     TN
1       A       1              100           3       N
1       A       101            200           3       N
2       A       1              100           3       T
2       A       101            200           3       T
3       B       1              100           3       N
3       B       101            200           2       N
4       B       1              100           5       T
4       B       101            200           5       T

Note that ID=B, region 1-100 did not change Values because N's Value = 3; region 101-200, did not change because Values for N & T were not the same.

I thought of using dplyr to group the matches, for example:

df <- df %>% arrange(ID, Ind, RegionStart, TN) %>% group_by(ID)

Or perhaps using data.table, but setting ID as the key? But I'm still not sure how to easily compare rows. I'm still pretty new to dplyr & data.table, so a short explanation of the command would be great if you use these. Feel free to use another package, though. The data is pretty large though, so it needs to be efficient.

Upvotes: 1

Views: 122

Answers (1)

Frank
Frank

Reputation: 66819

With data.table:

library(data.table)
setDT(DF)

DF[, Value := { 
  fixit = ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )
  if (fixit) 3L else Value
}, by=.(ID, RegionStart)]

Note that this will change your original data set (rather than simply returning an altered table).


With dplyr:

library(dplyr)
DF %>% group_by(ID, RegionStart) %>% 
  mutate(Value = {
    fixit = ( Value[TN=="N"] != 3L ) & ( n_distinct(Value) == 1L )
    if (fixit) 3L else Value
  })

How it works: uniqueN and n_distinct count the number of distinct values in a vector. If both elements of Value are the same, then this will return 1L.

Upvotes: 4

Related Questions