Shawn Brar
Shawn Brar

Reputation: 1420

Fill the actual data with duplicated data and remove duplicated data

I have the following data:-

library(data.table)
data <- data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(NA, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, NA, 223, 33),
                   versaze = c(NA, 22, 124, 33, NA, 44, 43))
data

#    address revenue castord versaze
# 1:      AA      NA      21      NA
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      NA      NA
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

Now this data has AA and DD addresses duplicated. And their first occurrence has some NA data as you can see in the row 1 and 5. What I want to do is fill this data by using the duplicated rows of these addresses. If the duplicated row has NA values, then NA shouldn't replace the values in the row of first occurrence. This will give me the following data:-

    data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(22, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, 33, 223, 33),
                   versaze = c(124, 22, 124, 33, 43, 44, 43))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      33      43
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

And then remove those duplicated rows:-

data <- data.table(address = c("AA", "BB", "CC", "DD", "EE"),
                   revenue = c(22, 121, 33, 44, 33),
                   castord = c(21, 22, 3, 33, 223),
                   versaze = c(124, 22, 33, 43, 44))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44

Upvotes: 1

Views: 50

Answers (2)

zx8754
zx8754

Reputation: 56169

Using data.table, group by address, loop through columns, remove NAs, get the first value:

data[, lapply(.SD, function(i) (na.omit(i)[ 1 ])), by = address]
#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44

Upvotes: 3

John Nielsen
John Nielsen

Reputation: 194

If you use dplyr::group_by and summarise where you take the first value that is not NA using na.omit, it fills out the empty values for you if the first row is NA.

data <- data %>% group_by(address) %>% 
  summarise(
    revenue = first(na.omit(revenue)),
    castord = first(na.omit(castord)),
    versaze = first(na.omit(versaze))
    )

Upvotes: 1

Related Questions