Numbering rows within groups in a data frame but without considering the NAs

Question

In stackoverflow there are several post asking about how to tagging specific combinations of levels among columns. However, I can't find how to do this but without considering the NAs. As an example:

df <- read.table(header=TRUE, text="
   cat        val  
1  aaa 0.05638315  
2  aaa 0.25767250  
3  aaa NA  
4  aaa 0.46854928  
5  aaa 0.55232243  
6  bbb NA  
7  bbb 0.37032054  
8  bbb 0.48377074  
9  bbb 0.54655860  
10 bbb 0.81240262  
11 ccc 0.28035384  
12 ccc 0.39848790  
13 ccc 0.62499648  
14 ccc 0.76255108  
15 ccc 0.88216552")

I would expect this:

df
   cat        val id
1  aaa 0.05638315  1
2  aaa 0.25767250  2
3  aaa         NA NA
4  aaa 0.46854928  3
5  aaa 0.55232243  4
6  bbb         NA NA
7  bbb 0.37032054  1
8  bbb 0.48377074  2
9  bbb 0.54655860  3
10 bbb 0.81240262  4
11 ccc 0.28035384  1
12 ccc 0.39848790  2
13 ccc 0.62499648  3
14 ccc 0.76255108  4
15 ccc 0.88216552  5

Possible codes are those, however, they do not contemplate NAs:

library(dplyr)
df <- df %>% group_by(cat) %>% mutate(id = row_number())

head(df)
# A tibble: 6 x 3
# Groups:   cat [2]
  cat       val    id
      
1 aaa    0.0564     1
2 aaa    0.258      2
3 aaa   NA          3
4 aaa    0.469      4
5 aaa    0.552      5
6 bbb   NA          1

library(data.table)
DT <- data.table(df)

DT[, id := seq_len(.N), by = cat]
df <- DT[, id := rowid(cat)]

head(df)
   cat        val id
1: aaa 0.05638315  1
2: aaa 0.25767250  2
3: aaa         NA  3
4: aaa 0.46854928  4
5: aaa 0.55232243  5
6: bbb         NA  1

Also, I wonder how to do if what I want is to start numbering from the first record, and from that, we numbering without considering the NAs. An example:

df2 <- read.table(header=TRUE, text="
   cat        val  
1  aaa 0.05638315  
2  aaa 0.25767250  
3  aaa NA  
4  aaa 0.46854928  
5  aaa 0.55232243  
6  bbb NA  
7  bbb 0.37032054  
8  bbb 0.48377074  
9  bbb NA  
10 bbb 0.81240262  
11 ccc 0.28035384  
12 ccc 0.39848790  
13 ccc 0.62499648  
14 ccc 0.76255108  
15 ccc 0.88216552")

I would expect that:

df2
   cat      val id
1  aaa 0.056383  1
2  aaa 0.257673  2
3  aaa       NA  3
4  aaa 0.468549  4
5  aaa 0.552322  5
6  bbb       NA NA
7  bbb 0.370321  1
8  bbb 0.483771  2
9  bbb       NA  3
10 bbb 0.812403  4
11 ccc 0.280354  1
12 ccc 0.398488  2
13 ccc 0.624996  3
14 ccc 0.762551  4
15 ccc 0.882166  5

I would appreciate any help.

Ronak Shah · Accepted Answer

You can tweak the suggestions that you have to get the desired output :

library(dplyr)

df %>%
  group_by(cat) %>%
  mutate(id = replace(row_number() - cumsum(is.na(val)), is.na(val), NA))

#   cat       val    id
#       
# 1 aaa    0.0564     1
# 2 aaa    0.258      2
# 3 aaa   NA         NA
# 4 aaa    0.469      3
# 5 aaa    0.552      4
# 6 bbb   NA         NA
# 7 bbb    0.370      1
# 8 bbb    0.484      2
# 9 bbb    0.547      3
#10 bbb    0.812      4
#11 ccc    0.280      1
#12 ccc    0.398      2
#13 ccc    0.625      3
#14 ccc    0.763      4
#15 ccc    0.882      5

This can be also be written in data.table

library(data.table)
setDT(df)[, id:= replace(seq_len(.N) - cumsum(is.na(val)), is.na(val), NA), cat]

and base R :

df$id <- with(df, replace(ave(val, cat, FUN = function(x) 
                  seq_along(x) - cumsum(is.na(x))), is.na(val), NA))

The logic is to create an id variable which is current row number in the group minus the number of NA's that have occurred till that time.

For updated df2 to consider the NA only in first row we can use :

df2 %>%
  group_by(cat) %>%
  mutate(id = replace(row_number() - is.na(first(val)),1* is.na(first(val)), NA))

#   cat       val    id
#       
# 1 aaa    0.0564     1
# 2 aaa    0.258      2
# 3 aaa   NA          3
# 4 aaa    0.469      4
# 5 aaa    0.552      5
# 6 bbb   NA         NA
# 7 bbb    0.370      1
# 8 bbb    0.484      2
# 9 bbb   NA          3
#10 bbb    0.812      4
#11 ccc    0.280      1
#12 ccc    0.398      2
#13 ccc    0.625      3
#14 ccc    0.763      4
#15 ccc    0.882      5

Numbering rows within groups in a data frame but without considering the NAs

Answers (2)

Related Questions