zStrike
zStrike

Reputation: 57

Add missing rows per missing Year by Group AND add more rows till present date year given a condition

I've been searching arround SO, and other data science and programming blogs, but i haven't found an answer for my specific need. So, if you find this question duplicate please be kind enough and point me to source of the information and close/delete this question.

My real data will have several thousands of rows, so i display here only a small set of fictional data, that resembles very similarly my original data:

Data <- data.frame(CategoryA =c("Xpto1","Xpto1","Xpto1","Xpto1","Xpto2","Xpto2","Xpto2","Xpto1","Xpto1", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3", "Xpto3"),CategoryB = c("Type1","Type1","Type1","Type1","Type1","Type1","Type1","Type2","Type2", "Type1", "Type1", "Type1", "Type1", "Type1", "Type1"),ID = c(1,1,1,1,2,2,2,3,3,4,4,4,4,4,4),Year =c(2014,2015,2016,2017,2007,2009,2010,2014,2016, 1997,2002,2010,2012,2013,2015),Class = c("New","Existing", "Existing", "Lost","New", "Existing", "Existing", "New", "Existing","New", "Lost", "Out","Recovered", "Existing", "Existing"))

I would post the data frame but i don't know how to do it properly in SO, althoug i'v read the proper thread about reproducible example: How to make a great R reproducible example?

My best attemp so far is as follows:

Data %>% 
group_by(CategoryA, CategoryB, ID) %>% 
complete(nesting(CategoryA, CategoryB, ID), Year = seq.int( min(Year), max(Year) ) ) %>%
arrange( ID, Year )

However, this aproach has 2 problems:

1 - it only adds rows to the missing years between min and max year of each group;

2 - when this aproach is used in the real data, since the dataset contains about 200K observations, the process is real slow (R takes about 2 or 3 minutes to complete the task at hand).

The explanation of the problem at hand, and rules of row assignation:

Given a list of item ID's, the Year, the Category A and the Category B variables, add rows in the data set in a way such as:

1 - given the max and min year for each group, insert row in missing year and assign the value "Existing" to the Class variable in that missing year (i'm assuming that the solution will automatcally fill Category A and Category B and ID values for each given group that has missing years);

2 - if in the max year of the given group the item is classified as "Existing" and the max year of that same group if inferior of todays date year, then add as many rows as necessary till todays date year and classify the item as "Existing".

Please note: i'm defining a group as the combination of variables: CategoryA, categoryB and ID

Please let me know if you found my rules explanation confusing, so i can have a chance to clarify them for you.

Thank you in advance, for any help you could provide.

EDIT: I updated the data set to better reflect the real Data.

Cheers! :)

Upvotes: 0

Views: 59

Answers (1)

Jaap
Jaap

Reputation: 83245

A possible solution using :

library(data.table)
setDT(Data)[, .SD[CJ(Year = seq(min(Year), ifelse(Class[which.max(Year)] == "Existing",
                                                  year(Sys.Date()), max(Year))))
                  , on = .(Year)]
            , by = .(ID, CategoryA, CategoryB)
            ][is.na(Class), Class := "Existing"][]

which gives:

    ID CategoryA CategoryB Year    Class
 1:  1     Xpto1     Type1 2014      New
 2:  1     Xpto1     Type1 2015 Existing
 3:  1     Xpto1     Type1 2016 Existing
 4:  1     Xpto1     Type1 2017     Lost
 5:  2     Xpto2     Type1 2007      New
 6:  2     Xpto2     Type1 2008 Existing
 7:  2     Xpto2     Type1 2009 Existing
 8:  2     Xpto2     Type1 2010 Existing
 9:  2     Xpto2     Type1 2011 Existing
10:  2     Xpto2     Type1 2012 Existing
11:  2     Xpto2     Type1 2013 Existing
12:  2     Xpto2     Type1 2014 Existing
13:  2     Xpto2     Type1 2015 Existing
14:  2     Xpto2     Type1 2016 Existing
15:  2     Xpto2     Type1 2017 Existing
16:  2     Xpto2     Type1 2018 Existing
17:  3     Xpto1     Type2 2014      New
18:  3     Xpto1     Type2 2015 Existing
19:  3     Xpto1     Type2 2016 Existing
20:  3     Xpto1     Type2 2017 Existing
21:  3     Xpto1     Type2 2018 Existing

Upvotes: 3

Related Questions