Reputation: 45
I have a large ordered dataframe consisting of a number of related records. For each group of related records, I need to number them from 1 to the total number of related records. If I iterate over the whole dataframe, the operation takes too long.
I'm wondering if there is a vectorized way to do this?
For example, if I had this dataframe:
ID Month State
1 Apr-2014 AL
2 May-2014 AL
3 Jun-2014 AL
4 Apr-2014 MN
5 May-2014 MN
6 Apr-2014 FL
7 May-2014 FL
I'd like to end up with:
ID Month State Seq
1 Apr-2014 AL 1
2 May-2014 AL 2
3 Jun-2014 AL 3
4 Apr-2014 MN 1
5 May-2014 MN 2
6 Apr-2014 FL 1
7 May-2014 FL 2
Upvotes: 0
Views: 67
Reputation: 887158
Using the example dataset. If the dataset is ordered, you can compare the previous row of Month
with the current row and check if they differ. Below code, I removed the first observation df$Month[-1]
and compared with the ones with last observation removed df$Month[-nrow(df)]
so that the lengths
are equal. By using !=
, we get TRUE
for values that are different. Concatenate with TRUE
at the beginning and do cumsum
to get the index
.
df$Seq <- cumsum(c(TRUE,df$Month[-1]!= df$Month[-nrow(df)]))
df
# ID Month State Seq
#1 1 Apr-2014 AL 1
#2 2 Apr-2014 MN 1
#3 3 Apr-2014 FL 1
#4 4 May-2014 AL 2
#5 5 May-2014 MN 2
#6 6 May-2014 FL 2
#7 7 Jun-2014 AL 3
Or you can convert the Month
column to factor
and reconvert it back to numeric
.
as.numeric(factor(df$Month, levels=unique(df$Month)))
#[1] 1 1 1 2 2 2 3
Or using data.table
library(data.table)
DT <- setDT(df)[, Seq:= .GRP, by=Month]
DT
# ID Month State Seq
#1: 1 Apr-2014 AL 1
#2: 2 Apr-2014 MN 1
#3: 3 Apr-2014 FL 1
#4: 4 May-2014 AL 2
#5: 5 May-2014 MN 2
#6: 6 May-2014 FL 2
#7: 7 Jun-2014 AL 3
.GRP
is a special variable. Have a look at ?data.table
to read more about it.
df <- structure(list(ID = 1:7, Month = c("Apr-2014", "Apr-2014", "Apr-2014",
"May-2014", "May-2014", "May-2014", "Jun-2014"), State = c("AL",
"MN", "FL", "AL", "MN", "FL", "AL")), .Names = c("ID", "Month",
"State"), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 4
Reputation: 20329
If you don't care about the actual seq number you could simply do:
df$Seq <- as.numeric(as.factor(df$Month))
df
# ID Month State Seq
# 1 1 Apr-2014 AL 1
# 2 2 Apr-2014 MN 1
# 3 3 Apr-2014 FL 1
# 4 4 May-2014 AL 3
# 5 5 May-2014 MN 3
# 6 6 May-2014 FL 3
# 7 7 Jun-2014 AL 2
If you care about the actual number in Seq (i.e. that it is in order), you should use something like:
month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
df$part1 <- with(df, factor(gsub("([^\\-]*).*", "\\1", Month), month))
df$part2 <- with(df, factor(gsub("[^\\-]*\\-(.*)", "\\1", Month)))
df$whole <- with(df, interaction(part1, part2, drop = TRUE, lex.order = FALSE))
df$Seq <- as.numeric(df$whole)
# ID Month State part1 part2 whole Seq
# 1 1 Apr-2014 AL Apr 2014 Apr.2014 1
# 2 2 Apr-2014 MN Apr 2014 Apr.2014 1
# 3 3 Apr-2014 FL Apr 2014 Apr.2014 1
# 4 4 May-2014 AL May 2014 May.2014 2
# 5 5 May-2014 MN May 2014 May.2014 2
# 6 6 May-2014 FL May 2014 May.2014 2
# 7 7 Jun-2014 AL Jun 2014 Jun.2014 3
Upvotes: 1