Maximilian
Maximilian

Reputation: 4229

Insert new row based on condition

I would like to insert efficiently (as I have 1mil. row df) new row if given id occurs only once.

Here sample data and desired result:

df <- data.frame(id=c(1,1,2,6,6,6,101),year=c(2012,2013,2014,2011,2012,2013,2015),value1=rnorm(7),value2=rexp(7),value3=rpois(7,2))


df <- transform(df, rec=ave(df$id, df$id, FUN=seq_along))

dfL <- df[!duplicated(df$id, fromLast=TRUE), ]
dfL[dfL$rec==1, ][-ncol(dfL)]

The desired result for the above already selected 2 rows:

id year    value1    value2 value3
2  2013      ""        ""     ""
2  2014   0.7345321 0.1449742  3
101 2014   ""          ""     ""
101 2015  0.2547620 0.7166841  1

For the whole data set it would be:

id  year  value1     value2      value3 
1   2012 -1.38007779 2.97957329      2  
1   2013 -0.03257949 0.92996719      1   
2   2013  ""         ""              ""
2   2014  0.73453214 0.14497416      3   
6   2011 -1.04242748 0.03099297      0   
6   2012  0.35932001 0.90399050      3   
6   2013  1.18300395 0.29446006      1   
101 2014  ""        ""              ""
101 2015  0.25476203 0.71668415      1   

Upvotes: 1

Views: 5181

Answers (2)

David Arenburg
David Arenburg

Reputation: 92292

You could try something as the following

library(data.table); 
setDT(df)[, if(.N < 2L) {
               .SD[c(NA_real_, 1L)][, year := c(year[2L] - 1L, year[2L])]
          } else .SD,
          by = id]
#     id year     value1     value2 value3
# 1:   1 2012 -1.6552064 0.05040943      4
# 2:   1 2013  1.3478321 1.08314748      2
# 3:   2 2013         NA         NA     NA
# 4:   2 2014  0.2320925 0.45546268      0
# 5:   6 2011  0.5110803 0.07412734      2
# 6:   6 2012  0.5170348 0.29052181      0
# 7:   6 2013  0.6925863 0.09309213      3
# 8: 101 2014         NA         NA     NA
# 9: 101 2015  1.6719612 0.84413431      4

This is basically checks per group if it's larger than 1 (.N denote the size of the group), if so it returns it as is (.SD denotes the SubData which is the data within a specific group), if not, it replicates the whole row twice (subsets .SD by its first value and some nonexistent NA value) for that id and fills the missing year (I'm assigning it to column year using the := operator while subsetting from the non-NA year (year[2] - 1L) one year an adding it with that year itself(year[2]))

Upvotes: 2

akrun
akrun

Reputation: 887118

We could get the row index (.I) for 'id's that have only a single element (.N==1), subset the dataset with that index, subtract 1 from the 'year' column, assign (:=) the 'value' columns to NA, rbind with the original dataset, and then order the rows of the resulting dataset.

ind <- setDT(df)[,.I[.N==1], by = id]$V1
DT <- rbind(df,
         df[ind][, list(year=year-1) ,.(id)][,names(df)[3:ncol(df)] := NA])
DT[order(id, year)]
#     id year      value1     value2 value3
#1:   1 2012 -0.03373792 0.28790583      0
#2:   1 2013 -0.58542756 1.02799997      2
#3:   2 2013          NA         NA     NA
#4:   2 2014  0.61285136 0.13438167      1
#5:   6 2011  1.51712249 0.63349530      1
#6:   6 2012  0.65738044 2.86740657      1
#7:   6 2013 -1.07418134 0.43519879      1
#8: 101 2014          NA         NA     NA
#9: 101 2015 -4.46956441 0.05370307      4

NOTE: It would be better to have NA instead of '' as assigning to '' will also change the class of the columns.


The rbind approach can be also done with base R. We get the count of 'id' using table, subset the original dataset based on the table output, transform the 'year' column by subtracting 1, get the column index ('indx') of 'value' as column names (grep), multiply those columns with NA to change those columns to NA, rbind the two datasets and order the dataset based on 'id' and 'year' column.

tbl <- table(df$id)==1
df1 <- transform(subset(df, id %in% names(tbl)[tbl]), year=year-1)
indx <- grep('value', colnames(df1))  
df1[indx] <- df1[indx]*NA
df2 <- rbind(df, df1)
df2[order(df2$id, df2$year),]

Upvotes: 2

Related Questions