Add missing rows per missing Year by Group AND add more rows till present date year given a condition

Question

I've been searching arround SO, and other data science and programming blogs, but i haven't found an answer for my specific need. So, if you find this question duplicate please be kind enough and point me to source of the information and close/delete this question.

My real data will have several thousands of rows, so i display here only a small set of fictional data, that resembles very similarly my original data:

Data <- data.frame(CategoryA =c("Xpto1","Xpto1","Xpto1","Xpto1","Xpto2","Xpto2","Xpto2","Xpto1","Xpto1", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3"),CategoryB = c("Type1","Type1","Type1","Type1","Type1","Type1","Type1","Type2","Type2", "Type1", "Type1", "Type1", "Type1", "Type1", "Type1"),ID = c(1,1,1,1,2,2,2,3,3,4,4,4,4,4,4),Year =c(2014,2015,2016,2017,2007,2009,2010,2014,2016, 1997,2002,2010,2012,2013,2015),Class = c("New","Existing", "Existing", "Lost","New", "Existing", "Existing", "New", "Existing","New", "Lost", "Out","Recovered", "Existing", "Existing"))

I would post the data frame but i don't know how to do it properly in SO, althoug i'v read the proper thread about reproducible example: How to make a great R reproducible example?

My best attemp so far is as follows:

Data %>% 
group_by(CategoryA, CategoryB, ID) %>% 
complete(nesting(CategoryA, CategoryB, ID), Year = seq.int( min(Year), max(Year) ) ) %>%
arrange( ID, Year )

However, this aproach has 2 problems:

1 - it only adds rows to the missing years between min and max year of each group;

2 - when this aproach is used in the real data, since the dataset contains about 200K observations, the process is real slow (R takes about 2 or 3 minutes to complete the task at hand).

The explanation of the problem at hand, and rules of row assignation:

Given a list of item ID's, the Year, the Category A and the Category B variables, add rows in the data set in a way such as:

1 - given the max and min year for each group, insert row in missing year and assign the value "Existing" to the Class variable in that missing year (i'm assuming that the solution will automatcally fill Category A and Category B and ID values for each given group that has missing years);

2 - if in the max year of the given group the item is classified as "Existing" and the max year of that same group if inferior of todays date year, then add as many rows as necessary till todays date year and classify the item as "Existing".

Please note: i'm defining a group as the combination of variables: CategoryA, categoryB and ID

Please let me know if you found my rules explanation confusing, so i can have a chance to clarify them for you.

Thank you in advance, for any help you could provide.

EDIT: I updated the data set to better reflect the real Data.

Cheers! :)

Add missing rows per missing Year by Group AND add more rows till present date year given a condition

Answers (1)

Related Questions