gibbz00
gibbz00

Reputation: 1987

Inserting rows into a dataframe based on a vector that contains dates

This is what my dataframe looks like:

df <- read.table(text='

    Name      ActivityType     ActivityDate              
     John       Email            2014-01-01                              
     John       Webinar          2014-01-05                            
     John       Webinar          2014-01-20                                                       
     John       Email            2014-04-20                            
     Tom        Email            2014-01-01                              
     Tom       Webinar           2014-01-05                           
     Tom       Webinar           2014-01-20                                                        
     Tom       Email             2014-04-20                              

    ', header=T, row.names = NULL)

I have this vector x which contains different dates x<- c("2014-01-03","2014-01-25","2015-05-27"). I want to insert rows in my original dataframe in a way that incorporates these dates in the x vector.This is what the output should look like:

    Name      ActivityType     ActivityDate              
     John       Email            2014-01-01
     John        NA              2014-01-03        
     John       Webinar          2014-01-05                            
     John       Webinar          2014-01-20
     John       NA               2014-01-25                                                       
     John       Email            2014-04-20
     John       NA               2015-05-27                            
     Tom        Email            2014-01-01
     Tom        NA               2014-01-03                              
     Tom       Webinar           2014-01-05                           
     Tom       Webinar           2014-01-20
     Tom       NA                2014-01-25                                                        
     Tom       Email             2014-04-20
     Tom       NA                2015-05-27  

Sincerely appreciate your help!

Upvotes: 0

Views: 81

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269654

1) expand.grid Using expand.grid create a data frame adds with the rows to be added and then use rbind to combine df and adds converting the ActivityDate column to "Date" class. Then sort. No packages are used.

adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))

o <- with(both, order(Name, ActivityDate))
both[o, ]

giving:

   Name ActivityType ActivityDate
1  John        Email   2014-01-01
9  John         <NA>   2014-01-03
2  John      Webinar   2014-01-05
3  John      Webinar   2014-01-20
11 John         <NA>   2014-01-25
4  John        Email   2014-04-20
13 John         <NA>   2015-05-27
5   Tom        Email   2014-01-01
10  Tom         <NA>   2014-01-03
6   Tom      Webinar   2014-01-05
7   Tom      Webinar   2014-01-20
12  Tom         <NA>   2014-01-25
8   Tom        Email   2014-04-20
14  Tom         <NA>   2015-05-27

2) sqldf This uploads adds and df to an sqlite data base which it creates on the fly, then it performs the sql query and downloads the result. The computation occurs outside of R so it might work with your large data.

adds <- data.frame(Name = NA, ActivityDate = x)

library(sqldf)

sqldf("select * 
       from (select * 
             from df 
             union 
             select a.Name, NULL ActivityType, ActivityDate 
             from (select distinct Name from df) a 
             cross join adds b
            ) order by 1, 3"
      )

giving:

   Name ActivityType ActivityDate
1  John        Email   2014-01-01
2  John         <NA>   2014-01-03
3  John      Webinar   2014-01-05
4  John      Webinar   2014-01-20
5  John         <NA>   2014-01-25
6  John        Email   2014-04-20
7  John         <NA>   2015-05-27
8   Tom        Email   2014-01-01
9   Tom         <NA>   2014-01-03
10  Tom      Webinar   2014-01-05
11  Tom      Webinar   2014-01-20
12  Tom         <NA>   2014-01-25
13  Tom        Email   2014-04-20
14  Tom         <NA>   2015-05-27

Upvotes: 3

SymbolixAU
SymbolixAU

Reputation: 26258

It looks like you've added one of the 'new' dates aginst each of the people, correct?

In which case you can turn your x into a data.frame, and merge/join it on

## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
                 ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
                 ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))

## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
                Name = rep(c("John","Tom"), 3))

merge(df, x, by=c("Name", "ActivityDate"), all=T)

#    Name ActivityDate ActivityType
# 1  John   2014-01-01        Email
# 2  John   2014-05-01          Web
# 3  John   2014-20-01          Web
# 4  John   2014-20-04        Email
# 5  John   2014-01-03         <NA>
# 6  John   2014-01-25         <NA>
# 7  John   2015-05-27         <NA>
# 8   Tom   2014-01-01        Email
# 9   Tom   2014-05-01          Web
# 10  Tom   2014-20-01          Web
# 11  Tom   2014-20-04        Email
# 12  Tom   2014-01-03         <NA>
# 13  Tom   2014-01-25         <NA>
# 14  Tom   2015-05-27         <NA>

Update

As you are having memory issues, you can use data.table thusly

library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)

merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)

or, if you're not looking to merge you can rbind them, using data.table's rbindlist

rbindlist(list(dt, x_dt), fill=TRUE)  ## fill sets the 'ActivityType' to NA in X

Update 2

To generate your x with 16000 uniqe names (I've used numbers here, but the principle is the same) and 30 dates

ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)

x <- data.frame(Names = rep(Names, length(ActivityDates)),
                           ActivityDates = rep(ActivityDates, length(Names)))

Upvotes: 4

Related Questions