Erazijus
Erazijus

Reputation: 71

Check if every group has same values in one of the columns

I don't know is question is described properly, but I'd like to check if every group has same values in column file_month and if no, then insert missing value by using next available row. So for example my data looks like this:

  file_month CountryName ProductName target_fe1 target_fe2 
       <int> <chr>       <chr>            <dbl>      <dbl>   
1          7 Germany     Inflation         7.52       4.84     
2          9 Germany     Inflation         7.92       4.74     
3          7 USA         Inflation         8.02       3.72     
4          8 USA         Inflation         8.02       3.72     
5          9 USA         Inflation         8.08       3.92     

It is grouped by CountryName and ProductName. Rows, where CountryName is "Germany", are missing file_month = 8, it only has file month = c(7,9), while every country should have file month = c(7,8,9).

I would like to fill this missing row in the way that it uses values from next row:

  file_month CountryName ProductName target_fe1 target_fe2 
       <int> <chr>       <chr>            <dbl>      <dbl>   
1          7 Germany     Inflation         7.52       4.84     
3          **8 Germany     Inflation         7.92       4.74**
4          9 Germany     Inflation         7.92       4.74     
5          7 USA         Inflation         8.02       3.72     
6          8 USA         Inflation         8.02       3.72     
7          9 USA         Inflation         8.08       3.92     

for reproducability:

df <- structure(list(file_month = c(7L, 9L, 7L, 8L, 9L), 
                     CountryName = c("Germany", "Germany", "USA", "USA", "USA"), 
                     ProductName = c("Inflation", "Inflation", "Inflation", "Inflation", "Inflation"), 
                     target_fe1 = c(7.52460071395529,7.92460071395529, 8.0153618755289, 8.0153618755289, 8.08264211789884), 
                     target_fe2 = c(4.8379739368994, 4.7379739368994, 3.71980735749082,  3.71980735749082, 3.92265497264457)),
                row.names = c(NA, -5L), class = "data.frame")

How can I do it, preferably using data.table?

Upvotes: 0

Views: 38

Answers (1)

Dimitrios Zacharatos
Dimitrios Zacharatos

Reputation: 850

Unfortunately I do now know data table. You can do something similar with base R and tidyr

df <- structure(list(file_month = c(7L, 9L, 7L, 8L, 9L), 
                     CountryName = c("Germany", "Germany", "USA", "USA", "USA"), 
                     ProductName = c("Inflation", "Inflation", "Inflation", "Inflation", "Inflation"), 
                     target_fe1 = c(7.52460071395529,7.92460071395529, 8.0153618755289, 8.0153618755289, 8.08264211789884), 
                     target_fe2 = c(4.8379739368994, 4.7379739368994, 3.71980735749082,  3.71980735749082, 3.92265497264457)),
                row.names = c(NA, -5L), class = "data.frame")

missing<-data.frame(table(df$file_month,df$CountryName))
missing<-missing[missing$Freq==0,]
new_rows<-data.frame(file_month=missing$Var1,CountryName=missing$Var2,ProductName="Inflation",target_fe1=NA,target_fe2=NA)
df<-rbind(df,new_rows)
df<-df[order(df$CountryName,df$file_month),]
tidyr::fill(df,target_fe1,target_fe2,.direction=c("updown"))

The output is

> tidyr::fill(df,target_fe1,target_fe2,.direction=c("updown"))
  file_month CountryName ProductName target_fe1 target_fe2
1          7     Germany   Inflation   7.524601   4.837974
6          8     Germany   Inflation   7.924601   4.737974
2          9     Germany   Inflation   7.924601   4.737974
3          7         USA   Inflation   8.015362   3.719807
4          8         USA   Inflation   8.015362   3.719807
5          9         USA   Inflation   8.082642   3.922655

Upvotes: 0

Related Questions