user3933907
user3933907

Reputation: 45

R Vectorized numbering of rows in a dataframe

I have a large ordered dataframe consisting of a number of related records. For each group of related records, I need to number them from 1 to the total number of related records. If I iterate over the whole dataframe, the operation takes too long.

I'm wondering if there is a vectorized way to do this?

For example, if I had this dataframe:

ID  Month    State
1   Apr-2014  AL
2   May-2014  AL
3   Jun-2014  AL
4   Apr-2014  MN
5   May-2014  MN
6   Apr-2014  FL
7   May-2014  FL

I'd like to end up with:

ID  Month    State  Seq
1   Apr-2014  AL    1
2   May-2014  AL    2
3   Jun-2014  AL    3
4   Apr-2014  MN    1
5   May-2014  MN    2
6   Apr-2014  FL    1
7   May-2014  FL    2

Upvotes: 0

Views: 67

Answers (2)

akrun
akrun

Reputation: 887158

Using the example dataset. If the dataset is ordered, you can compare the previous row of Month with the current row and check if they differ. Below code, I removed the first observation df$Month[-1] and compared with the ones with last observation removed df$Month[-nrow(df)] so that the lengths are equal. By using !=, we get TRUE for values that are different. Concatenate with TRUE at the beginning and do cumsum to get the index.

 df$Seq <- cumsum(c(TRUE,df$Month[-1]!= df$Month[-nrow(df)]))
 df
 #  ID    Month State Seq
 #1  1 Apr-2014    AL   1
 #2  2 Apr-2014    MN   1
 #3  3 Apr-2014    FL   1
 #4  4 May-2014    AL   2
 #5  5 May-2014    MN   2
 #6  6 May-2014    FL   2
 #7  7 Jun-2014    AL   3

Or you can convert the Month column to factor and reconvert it back to numeric.

 as.numeric(factor(df$Month, levels=unique(df$Month)))
 #[1] 1 1 1 2 2 2 3

Or using data.table

 library(data.table)
  DT <- setDT(df)[, Seq:= .GRP, by=Month]
  DT
  #   ID    Month State Seq
  #1:  1 Apr-2014    AL   1
  #2:  2 Apr-2014    MN   1
  #3:  3 Apr-2014    FL   1
  #4:  4 May-2014    AL   2
  #5:  5 May-2014    MN   2
  #6:  6 May-2014    FL   2
  #7:  7 Jun-2014    AL   3

.GRP is a special variable. Have a look at ?data.table to read more about it.

data

 df <-  structure(list(ID = 1:7, Month = c("Apr-2014", "Apr-2014", "Apr-2014", 
"May-2014", "May-2014", "May-2014", "Jun-2014"), State = c("AL", 
 "MN", "FL", "AL", "MN", "FL", "AL")), .Names = c("ID", "Month", 
 "State"), class = "data.frame", row.names = c(NA, -7L))

 

Upvotes: 4

thothal
thothal

Reputation: 20329

If you don't care about the actual seq number you could simply do:

df$Seq <- as.numeric(as.factor(df$Month))
df
#   ID    Month State Seq
# 1  1 Apr-2014    AL   1
# 2  2 Apr-2014    MN   1
# 3  3 Apr-2014    FL   1
# 4  4 May-2014    AL   3
# 5  5 May-2014    MN   3
# 6  6 May-2014    FL   3
# 7  7 Jun-2014    AL   2

If you care about the actual number in Seq (i.e. that it is in order), you should use something like:

month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
df$part1 <- with(df, factor(gsub("([^\\-]*).*", "\\1", Month), month))
df$part2 <- with(df, factor(gsub("[^\\-]*\\-(.*)", "\\1", Month)))
df$whole <- with(df, interaction(part1, part2, drop = TRUE, lex.order = FALSE))
df$Seq <- as.numeric(df$whole)
#   ID    Month State part1 part2    whole Seq
# 1  1 Apr-2014    AL   Apr  2014 Apr.2014   1
# 2  2 Apr-2014    MN   Apr  2014 Apr.2014   1
# 3  3 Apr-2014    FL   Apr  2014 Apr.2014   1
# 4  4 May-2014    AL   May  2014 May.2014   2
# 5  5 May-2014    MN   May  2014 May.2014   2
# 6  6 May-2014    FL   May  2014 May.2014   2
# 7  7 Jun-2014    AL   Jun  2014 Jun.2014   3

Upvotes: 1

Related Questions