Reputation: 4229
I would like to insert efficiently
(as I have 1mil. row df) new row if given id
occurs only once.
Here sample data and desired result:
df <- data.frame(id=c(1,1,2,6,6,6,101),year=c(2012,2013,2014,2011,2012,2013,2015),value1=rnorm(7),value2=rexp(7),value3=rpois(7,2))
df <- transform(df, rec=ave(df$id, df$id, FUN=seq_along))
dfL <- df[!duplicated(df$id, fromLast=TRUE), ]
dfL[dfL$rec==1, ][-ncol(dfL)]
The desired result for the above already selected 2 rows:
id year value1 value2 value3
2 2013 "" "" ""
2 2014 0.7345321 0.1449742 3
101 2014 "" "" ""
101 2015 0.2547620 0.7166841 1
For the whole data set it would be:
id year value1 value2 value3
1 2012 -1.38007779 2.97957329 2
1 2013 -0.03257949 0.92996719 1
2 2013 "" "" ""
2 2014 0.73453214 0.14497416 3
6 2011 -1.04242748 0.03099297 0
6 2012 0.35932001 0.90399050 3
6 2013 1.18300395 0.29446006 1
101 2014 "" "" ""
101 2015 0.25476203 0.71668415 1
Upvotes: 1
Views: 5181
Reputation: 92292
You could try something as the following
library(data.table);
setDT(df)[, if(.N < 2L) {
.SD[c(NA_real_, 1L)][, year := c(year[2L] - 1L, year[2L])]
} else .SD,
by = id]
# id year value1 value2 value3
# 1: 1 2012 -1.6552064 0.05040943 4
# 2: 1 2013 1.3478321 1.08314748 2
# 3: 2 2013 NA NA NA
# 4: 2 2014 0.2320925 0.45546268 0
# 5: 6 2011 0.5110803 0.07412734 2
# 6: 6 2012 0.5170348 0.29052181 0
# 7: 6 2013 0.6925863 0.09309213 3
# 8: 101 2014 NA NA NA
# 9: 101 2015 1.6719612 0.84413431 4
This is basically checks per group if it's larger than 1 (.N
denote the size of the group), if so it returns it as is (.SD
denotes the SubData which is the data within a specific group), if not, it replicates the whole row twice (subsets .SD
by its first value and some nonexistent NA
value) for that id and fills the missing year (I'm assigning it to column year
using the :=
operator while subsetting from the non-NA
year (year[2] - 1L
) one year an adding it with that year itself(year[2]
))
Upvotes: 2
Reputation: 887118
We could get the row index (.I
) for 'id's that have only a single element (.N==1
), subset the dataset with that index, subtract 1 from the 'year' column, assign (:=
) the 'value' columns to NA
, rbind
with the original dataset, and then order
the rows of the resulting dataset.
ind <- setDT(df)[,.I[.N==1], by = id]$V1
DT <- rbind(df,
df[ind][, list(year=year-1) ,.(id)][,names(df)[3:ncol(df)] := NA])
DT[order(id, year)]
# id year value1 value2 value3
#1: 1 2012 -0.03373792 0.28790583 0
#2: 1 2013 -0.58542756 1.02799997 2
#3: 2 2013 NA NA NA
#4: 2 2014 0.61285136 0.13438167 1
#5: 6 2011 1.51712249 0.63349530 1
#6: 6 2012 0.65738044 2.86740657 1
#7: 6 2013 -1.07418134 0.43519879 1
#8: 101 2014 NA NA NA
#9: 101 2015 -4.46956441 0.05370307 4
NOTE: It would be better to have NA
instead of ''
as assigning to ''
will also change the class
of the columns.
The rbind
approach can be also done with base R
. We get the count of 'id' using table
, subset
the original dataset based on the table output, transform
the 'year' column by subtracting 1, get the column index ('indx') of 'value' as column names (grep
), multiply those columns with NA
to change those columns to NA
, rbind
the two datasets and order
the dataset based on 'id' and 'year' column.
tbl <- table(df$id)==1
df1 <- transform(subset(df, id %in% names(tbl)[tbl]), year=year-1)
indx <- grep('value', colnames(df1))
df1[indx] <- df1[indx]*NA
df2 <- rbind(df, df1)
df2[order(df2$id, df2$year),]
Upvotes: 2