Reputation: 57
I've been searching arround SO, and other data science and programming blogs, but i haven't found an answer for my specific need. So, if you find this question duplicate please be kind enough and point me to source of the information and close/delete this question.
My real data will have several thousands of rows, so i display here only a small set of fictional data, that resembles very similarly my original data:
Data <- data.frame(CategoryA =c("Xpto1","Xpto1","Xpto1","Xpto1","Xpto2","Xpto2","Xpto2","Xpto1","Xpto1", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3"),CategoryB = c("Type1","Type1","Type1","Type1","Type1","Type1","Type1","Type2","Type2", "Type1", "Type1", "Type1", "Type1", "Type1", "Type1"),ID = c(1,1,1,1,2,2,2,3,3,4,4,4,4,4,4),Year =c(2014,2015,2016,2017,2007,2009,2010,2014,2016, 1997,2002,2010,2012,2013,2015),Class = c("New","Existing", "Existing", "Lost","New", "Existing", "Existing", "New", "Existing","New", "Lost", "Out","Recovered", "Existing", "Existing"))
I would post the data frame but i don't know how to do it properly in SO, althoug i'v read the proper thread about reproducible example: How to make a great R reproducible example?
My best attemp so far is as follows:
Data %>%
group_by(CategoryA, CategoryB, ID) %>%
complete(nesting(CategoryA, CategoryB, ID), Year = seq.int( min(Year), max(Year) ) ) %>%
arrange( ID, Year )
However, this aproach has 2 problems:
1 - it only adds rows to the missing years between min and max year of each group;
2 - when this aproach is used in the real data, since the dataset contains about 200K observations, the process is real slow (R takes about 2 or 3 minutes to complete the task at hand).
The explanation of the problem at hand, and rules of row assignation:
Given a list of item ID's, the Year, the Category A and the Category B variables, add rows in the data set in a way such as:
1 - given the max and min year for each group, insert row in missing year and assign the value "Existing" to the Class variable in that missing year (i'm assuming that the solution will automatcally fill Category A and Category B and ID values for each given group that has missing years);
2 - if in the max year of the given group the item is classified as "Existing" and the max year of that same group if inferior of todays date year, then add as many rows as necessary till todays date year and classify the item as "Existing".
Please note: i'm defining a group as the combination of variables: CategoryA, categoryB and ID
Please let me know if you found my rules explanation confusing, so i can have a chance to clarify them for you.
Thank you in advance, for any help you could provide.
EDIT: I updated the data set to better reflect the real Data.
Cheers! :)
Upvotes: 0
Views: 59
Reputation: 83245
A possible solution using data.table:
library(data.table)
setDT(Data)[, .SD[CJ(Year = seq(min(Year), ifelse(Class[which.max(Year)] == "Existing",
year(Sys.Date()), max(Year))))
, on = .(Year)]
, by = .(ID, CategoryA, CategoryB)
][is.na(Class), Class := "Existing"][]
which gives:
ID CategoryA CategoryB Year Class 1: 1 Xpto1 Type1 2014 New 2: 1 Xpto1 Type1 2015 Existing 3: 1 Xpto1 Type1 2016 Existing 4: 1 Xpto1 Type1 2017 Lost 5: 2 Xpto2 Type1 2007 New 6: 2 Xpto2 Type1 2008 Existing 7: 2 Xpto2 Type1 2009 Existing 8: 2 Xpto2 Type1 2010 Existing 9: 2 Xpto2 Type1 2011 Existing 10: 2 Xpto2 Type1 2012 Existing 11: 2 Xpto2 Type1 2013 Existing 12: 2 Xpto2 Type1 2014 Existing 13: 2 Xpto2 Type1 2015 Existing 14: 2 Xpto2 Type1 2016 Existing 15: 2 Xpto2 Type1 2017 Existing 16: 2 Xpto2 Type1 2018 Existing 17: 3 Xpto1 Type2 2014 New 18: 3 Xpto1 Type2 2015 Existing 19: 3 Xpto1 Type2 2016 Existing 20: 3 Xpto1 Type2 2017 Existing 21: 3 Xpto1 Type2 2018 Existing
Upvotes: 3