Reputation: 51
I have an R data processing problem. I have a large set of data in a .csv file that I can load in using readr
. The problem is that there are missing values that I need to add to the dataframe.
Example data:
X1 X2 Value
Coal 1 300
Coal 3 100
Coal 5 150
NatG 2 175
NatG 4 150
This file will load in correctly. However, I want to add processing so that the final dataframe contains an entry for every X1 type and the entire sequence 1:5 in X2 with the missing values set equal to zero. The final dataframe would like like this:
X1 X2 Value
Coal 1 300
Coal 2 0
Coal 3 100
Coal 4 0
Coal 5 150
NatG 1 0
NatG 2 175
NatG 3 0
NatG 4 150
NatG 5 0
I tried using readr to load in a file that had all entries equal to zero and then read in the real data to the same data frame, but it was a destructive overwrite and the missing rows were removed. I need to know how to add appropriate rows to the dataframe without knowing which values in the 1:5 sequence will be missing for each value under X1.
Upvotes: 5
Views: 3023
Reputation: 39174
We can use the complete
function from the tidyr package. The complete
function creates combination based on columns provided. The fill
argument can specify the fill values.
library(tidyr)
dt2 <- dt %>%
complete(X1, X2, fill = list(Value = 0))
dt2
# # A tibble: 10 x 3
# X1 X2 Value
# <chr> <int> <dbl>
# 1 Coal 1 300
# 2 Coal 2 0
# 3 Coal 3 100
# 4 Coal 4 0
# 5 Coal 5 150
# 6 NatG 1 0
# 7 NatG 2 175
# 8 NatG 3 0
# 9 NatG 4 150
# 10 NatG 5 0
We can also specify the group values for the combination if some group values are missing. For example, we can do the following to make sure 1 to 5 in column X2
all has been considered.
dt2 <- dt %>%
complete(X1, X2 = 1:5, fill = list(Value = 0))
Or use full_seq
, which will automatically create the sequence based on the minimum and maximum values in a column.
dt2 <- dt %>%
complete(X1, X2 = full_seq(X2, period = 1), fill = list(Value = 0))
DATA
dt <- read.table(text = "X1 X2 Value
Coal 1 300
Coal 3 100
Coal 5 150
NatG 2 175
NatG 4 150",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 7
Reputation: 7312
Great solution from @www with tidyr
. This is a way to do it in base R.
levels <- 1:5
type <- c("Coal", "NatG")
df2 <- expand.grid(levels,type)
df3 <- merge(dt, df2, by.x=c("X1","X2"), by.y=c("Var2","Var1"),all.y=T)
df3[is.na(df3)] <- 0
X1 X2 Value
1 Coal 1 300
2 Coal 2 0
3 Coal 3 100
4 Coal 4 0
5 Coal 5 150
6 NatG 1 0
7 NatG 2 175
8 NatG 3 0
9 NatG 4 150
10 NatG 5 0
Upvotes: 3