Javmi
Javmi

Reputation: 35

R: Identify per group the row number if value in a column lies between possible pair of values in two other columns

I have a question that might be simple but I just can't figure it out. An example of my dataframe looks like this:

> df.corrected
   Grp Grp.ind ini.1 fin.1 ini.2 fin.2 
1    A     A.1     0     5     0     5 
2    A     A.2     5    10     5    25 
3    A     A.3    10    15    NA    NA 
4    A     A.4    15    26    NA    NA 
5    A     A.5    26    28    NA    NA 
6    A     A.6    28    30    25    30 
7    B     B.1     0    15     0    10 
8    B     B.2    NA    NA    10    20 
9    B     B.3    15    20    20    25 
10   B     B.4    20    31    25    30 
11   B     B.5    31    50    30    50 

I want to look in each group for the row number in which ini.1 for each observation lies between possible ini.2 and fin.2 values (i.e. ini.2<=ini.1< fin.2, for all pairs of ini.2,fin.2). I want to know the row number to create a new grp_ind id. My desire output should look like this:

> df.corrected
   Grp Grp.ind ini.1 fin.1 ini.2 fin.2 rownum New.Grp.ind
1    A     A.1     0     5     0     5      1         A.1
2    A     A.2     5    10     5    25      2         A.2
3    A     A.3    10    15    NA    NA      2         A.2
4    A     A.4    15    26    NA    NA      2         A.2
5    A     A.5    26    28    NA    NA      6         A.6
6    A     A.6    28    30    25    30      6         A.6
7    B     B.1     0    15     0    10      1         B.1
8    B     B.2    NA    NA    10    20     NA         B.2
9    B     B.3    15    20    20    25      2         B.2
10   B     B.4    20    31    25    30      3         B.3
11   B     B.5    31    50    30    50      5         B.5

So far I tried:

df.corrected<-df %>%
  group_by(Grp) %>%
  mutate(rownum=ifelse(!(ini.1 >=ini.2 & ini.1 < fin.2),
                   NA, row_number())) %>%
  mutate(rownum=ifelse(is.na(rownum),
                   row_number(which((ini.1 >=(ini.2%in%ini.2)) & (ini.1 < (fin.2%in%fin.2)))),rownum)) %>%
  mutate(New.Grp.ind = Grp.ind[rownum])

and also taking out which() in the second mutate() and I did not have any success. Because my dataframe has over 6k observations I want a flexible solution without using na.locf. I tried that solution before and does not perform well in the whole dataset.

Does anyone have a guidance on how to tackle this problem?

I thank you all in advance for your help.

Upvotes: 1

Views: 293

Answers (1)

MKR
MKR

Reputation: 20085

The approach using dplyr is to self-join df.corrected by Grp and then apply filter to satisfy rules mentioned by OP. This will provide New.Grp.ind for matching conditions. Finally, for rows which could not meet matching rules we have join back (using right_join) with df.corrected by Grp and Grp.ind.

Note: I have assumed that Grp + Grp.ind represents a row uniquely in data. If not, then a row number should be added in data which can be using as part of right_join.

library(dplyr)

df.corrected %>% inner_join((df.corrected %>% group_by(Grp) %>%
        mutate(rownum= row_number())), by="Grp") %>%
  filter(ini.1.x >=ini.2.y & ini.1.x < fin.2.y) %>%
  select( Grp, Grp.ind = Grp.ind.x, ini.1 = ini.1.x, fin.1 = fin.1.x, ini.2 = ini.2.x, 
           fin.2 = fin.2.x, rownum, New.Grp.ind = Grp.ind.y) %>%
  right_join(df.corrected, by=c("Grp","Grp.ind")) %>%
  select( Grp, Grp.ind, ini.1 = ini.1.x, fin.1 = fin.1.x, ini.2 = ini.2.x, fin.2 = fin.2.x,
         rownum, New.Grp.ind) %>%
  mutate(New.Grp.ind = coalesce(New.Grp.ind, Grp.ind))

#    Grp Grp.ind ini.1 fin.1 ini.2 fin.2 rownum New.Grp.ind
# 1    A     A.1     0     5     0     5      1         A.1
# 2    A     A.2     5    10     5    25      2         A.2
# 3    A     A.3    10    15    NA    NA      2         A.2
# 4    A     A.4    15    26    NA    NA      2         A.2
# 5    A     A.5    26    28    NA    NA      6         A.6
# 6    A     A.6    28    30    25    30      6         A.6
# 7    B     B.1     0    15     0    10      1         B.1
# 8    B     B.2    NA    NA    NA    NA     NA         B.2
# 9    B     B.3    15    20    20    25      2         B.2
# 10   B     B.4    20    31    25    30      3         B.3
# 11   B     B.5    31    50    30    50      5         B.5

Data:

df.corrected <- read.table(text = 
"Grp Grp.ind ini.1 fin.1 ini.2 fin.2 
1    A     A.1     0     5     0     5 
2    A     A.2     5    10     5    25 
3    A     A.3    10    15    NA    NA 
4    A     A.4    15    26    NA    NA 
5    A     A.5    26    28    NA    NA 
6    A     A.6    28    30    25    30 
7    B     B.1     0    15     0    10 
8    B     B.2    NA    NA    10    20 
9    B     B.3    15    20    20    25 
10   B     B.4    20    31    25    30 
11   B     B.5    31    50    30    50",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

Related Questions