Carrol
Carrol

Reputation: 1285

R: Replacing data by group in dataframe

I have a dataset of this style:

id1  id2  start_line end_line content   
A    B    1          1        "aaaa" 
A    B    4          4        "aa mm" 
A    B    5          5        "boool"
A    B    6          6        "omw"   
C    D    6          6        "hear!" 
C    D    7          7        " me out!"
C    D    21         21       "hello"   

I need to mutate this several times, with specific criteria. In particular, rows that have the same id1, same id2 and consecutive start_line:

So, the expected result would be:

id1  id2  start_line end_line content      real_line   cid
A    B    1          1        "aaaa"        1          1
A    B    4          6        "aa mm"       4          2
A    B    4          6        "boool"       5          2
A    B    4          6        "omw"         6          2
C    D    6          7        "hear!"       6          3
C    D    6          7        " me out!"    7          3
C    D    21         21       "hello"       21         4

I can add real_line by simply copying the original column, but I don't know how to replace start_line and end_line without summarising.

Upvotes: 1

Views: 827

Answers (2)

akrun
akrun

Reputation: 887891

We group by 'id1', 'id2', then create the 'cid' based on the

library(dplyr)
df %>% 
     group_by(id1, id2) %>% 
     group_by(grp = cumsum(c(TRUE, diff(start_line)  != 1)), 
           .add = TRUE) %>% 
    mutate(real_line = start_line, 
       start_line = first(start_line), end_line = last(end_line)) %>%
    mutate(cid = cur_group_id()) %>%
    ungroup %>%
    select(-grp)

-output

# A tibble: 7 x 7
#  id1   id2   start_line end_line content      cid real_line
#  <chr> <chr>      <int>    <int> <chr>      <int>     <int>
#1 A     B              1        1 "aaaa"         1         1
#2 A     B              4        6 "aa mm"        2         4
#3 A     B              4        6 "boool"        2         5
#4 A     B              4        6 "omw"          2         6
#5 C     D              6        7 "hear!"        3         6
#6 C     D              6        7 " me out!"     3         7
#7 C     D             21       21 "hello"        4        21

Upvotes: 1

Carrol
Carrol

Reputation: 1285

Okay, the problem was that I wasn't ungrouping.

So based on R - Concatenate cell in dataframe, by group, depending on another cell value

I did:

mydf$real_line <- mydf$start_line


mydf %>% 
      group_by(id1, id2, grp = cumsum(c(TRUE, diff(start_line) > 1))) %>% 
      mutate(start_line = first(start_line), end_line = last(end_line)) %>%
      ungroup()

mydf$grp <- NULL

And this generated the result I needed, but without the ID per group.

Upvotes: 1

Related Questions