barbrka
barbrka

Reputation: 153

How to define rows numbering depending on a group and a value in group's first rows?

A dataframe DD has some missing rows. Based on the values in 'ID_raw' column I have duplicated the rows in order to replace the missing rows. Now I have to number the rows in such way that the first value in each group (column 'File') equals the value in the same row in the column 'ID_raw'. This will be a key in joining the dataframe with another one. Below a dummy example of the DD dataframe:

DD<-data.frame(ID_raw=c(1,5,7,8,5,7,9,13,3,6),Val=c(1,2,8,15,54,23,88,77,32,2),File=c("A","A","A","A","B","B","B","B","C","C"))
   ID_raw Val File
1       1   1    A
2       5   2    A
3       7   8    A
4       8  15    A
5       5  54    B
6       7  23    B
7       9  88    B
8      13  77    B
9       3  32    C
10      6   2    C

So far I've successfully duplicated the rows, however, I have a problem in numbering the rows in such way, that they start from the same value as the value in ID_raw column for each group ('File').

DD$ID_diff <- 0
DD$ID_diff[1:nrow(DD)-1] <- as.integer(diff(DD$ID_raw, 1)) #values which tell how many times a row has to be duplicated
DD$ID_diff <- sapply(DD$ID_diff, function(x) ifelse(x<0, 0, x)) #replacement the values <0 (for the first rows in each 'File' group)
DD <- DD[rep(seq(nrow(DD)), DD$ID_diff), 1:ncol(DD)] #rows duplication

Based on the code above I receive this output:

    ID_raw Val File ID_diff
1        1   1    A       4
1.1      1   1    A       4
1.2      1   1    A       4
1.3      1   1    A       4
2        5   2    A       2
2.1      5   2    A       2
3        7   8    A       1
5        5  54    B       2
5.1      5  54    B       2
6        7  23    B       2
6.1      7  23    B       2
7        9  88    B       4
7.1      9  88    B       4
7.2      9  88    B       4
7.3      9  88    B       4
9        3  32    C       3
9.1      3  32    C       3
9.2      3  32    C       3

I would like to receive this:

    ID_raw Val File ID_diff ID_new
1        1   1    A       4      1
1.1      1   1    A       4      2
1.2      1   1    A       4      3
1.3      1   1    A       4      4
2        5   2    A       2      5
2.1      5   2    A       2      6
3        7   8    A       1      7
5        5  54    B       2      5
5.1      5  54    B       2      6
6        7  23    B       2      7
6.1      7  23    B       2      8
7        9  88    B       4      9
7.1      9  88    B       4      10
7.2      9  88    B       4      11
7.3      9  88    B       4      12
9        3  32    C       3      3
9.1      3  32    C       3      4
9.2      3  32    C       3      5

Upvotes: 0

Views: 63

Answers (2)

akrun
akrun

Reputation: 886948

We can do this in the chain from the beginning itself i.e. instead of creating the 'ID_diff' and using sapply, directly use diff on the 'ID_raw', then uncount, grouped by 'File', create the sequence column

library(tidyverse)
DD %>%
    mutate(ID_diff = pmax(c(diff(ID_raw), 0), 0)) %>%
    uncount(ID_diff, .remove = FALSE) %>%
    group_by(File) %>% 
    mutate(ID_new = seq(first(ID_raw), length.out = n(), by = 1))
# A tibble: 18 x 5
# Groups:   File [3]
#   ID_raw   Val File  ID_diff ID_new
#    <dbl> <dbl> <fct>   <dbl>  <dbl>
# 1      1     1 A           4      1
# 2      1     1 A           4      2
# 3      1     1 A           4      3
# 4      1     1 A           4      4
# 5      5     2 A           2      5
# 6      5     2 A           2      6
# 7      7     8 A           1      7
# 8      5    54 B           2      5
# 9      5    54 B           2      6
#10      7    23 B           2      7
#11      7    23 B           2      8
#12      9    88 B           4      9
#13      9    88 B           4     10
#14      9    88 B           4     11
#15      9    88 B           4     12
#16      3    32 C           3      3
#17      3    32 C           3      4
#18      3    32 C           3      5

Upvotes: 1

Humpelstielzchen
Humpelstielzchen

Reputation: 6441

This is one option using dplyr based on the output of your code:

df %>%
  group_by(File) %>%
  mutate(ID_new = seq(1, n()) + first(ID_raw) - 1)

# A tibble: 18 x 5
# Groups:   File [3]
   ID_raw   Val File  ID_diff ID_new
    <int> <int> <fct>   <int>  <dbl>
 1      1     1 A           4      1
 2      1     1 A           4      2
 3      1     1 A           4      3
 4      1     1 A           4      4
 5      5     2 A           2      5
 6      5     2 A           2      6
 7      7     8 A           1      7
 8      5    54 B           2      5
 9      5    54 B           2      6
10      7    23 B           2      7
11      7    23 B           2      8
12      9    88 B           4      9
13      9    88 B           4     10
14      9    88 B           4     11
15      9    88 B           4     12
16      3    32 C           3      3
17      3    32 C           3      4
18      3    32 C           3      5

Upvotes: 2

Related Questions