Bolle
Bolle

Reputation: 322

R combining duplicate rows by ID with different column types in a dataframe

I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Now I want to combine the duplicate IDs. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

In the end the result should look like this

dfRes <- data.frame(id    = c(1, 2, 3, 4),
                    abst  = c(1, 0, 0, 3),
                    farbe = as.factor(c("keine", "keine", "keine", "rot")),
                    gier  = c(2.5, 0, 0, 3))

I hope there is a way to do the calculations fast, because I have about a million observations. Thanks in advance!

Edit (additions): "farbe" may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe", so there are 2 identical IDs and all the same but different values for "farbe". This should be just very rare case, but a great addition.

I have a lot more numeric and factor columns in my real data. Is it also possible to create a solution, so I don't have to define every single column?

Upvotes: 0

Views: 1780

Answers (4)

Andrew
Andrew

Reputation: 5138

EDIT:

Just saw your edit about non-unique factor columns and selecting columns by type. This will work but I will think about a cleaner way to do this and report back (I am sure there is a simple way). If you want to manually specify columns like the original example and you have non-unique factors, just use unlist() with unique() in the same fashion as below. Alternatively, you could consider combining all factor levels on one line using paste() with collapse = "; " or something to that effect. If you want to change the column order for the final data.table, use setcolorder() on the data.table

setDT(df)

# For selecting columns later
num_cols <- sapply(df, is.numeric)
num_cols[names(num_cols) == "id"] <- FALSE
fac_cols <- sapply(df, is.factor)

df[, lapply(.SD, mean, na.rm = T), by = id, .SDcols = num_cols][
  df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols], on = "id"]

   id abst gier farbe
1:  1    1  2.5 keine
2:  2    0  0.0 keine
3:  3    0  0.0 keine
4:  4    3  3.0  rot2
5:  4    3  3.0   rot

How it works: It joins the numeric column summary

df[, lapply(.SD, mean, na.rm = TRUE), by = id, .SDcols = num_cols]

with the factor column summary

df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols]

Data for edit:

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, "rot2", "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Original answer:

Here is one of many data.table solutions. This orders the data.table by the factor column so it can grab the top value while summarizing. I also converted it back to a pure data.frame but you do not have to do that if you do not want to. Hope this helps!

Also, this assumes that farbe will be the same for each id

library(data.table)

setDT(df)

df <- df[order(farbe), .(abst = mean(abst, na.rm = TRUE),
                         farbe = farbe[1],
                         gier = mean(gier, na.rm = TRUE)), by = id]

setDF(df)
df
  id abst farbe gier
1  1    1 keine  2.5
2  2    0 keine  0.0
3  3    0 keine  0.0
4  4    3   rot  3.0

Upvotes: 2

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

additional solution

df %>% 
  group_by(id) %>% 
  fill(farbe, .direction = "updown") %>% 
  group_by(id, farbe) %>% 
  summarise_all(~ mean(., na.rm = T))

Upvotes: 0

Simone Bianchi
Simone Bianchi

Reputation: 143

Also a data.table solution:

    library(data.table)
    df <- data.table( # instead of data.frame
                     id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4), 
                     abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                     farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                     gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))


    newdf <- df[,
                .(abst=mean(abst,na.rm=T), # perform mean of abst removing NAs
                  farbe=na.omit(unique(farbe)), # assuming farbe is unique for each ID, extract the unique value after removing the NAs
                  gier=mean(gier,na.rm=T)), # perform mean of gier removing NAs
                by=id] # for each ID

    newdf

       id abst farbe gier
    1:  1    1 keine  2.5
    2:  2    0 keine  0.0
    3:  3    0 keine  0.0
    4:  4    3   rot  3.0

Upvotes: 0

Allan Cameron
Allan Cameron

Reputation: 173813

A dplyr solution.

library(dplyr)

df %>% 
  group_by(id) %>% 
  summarise(abst = mean(na.omit(abst)), 
            farbe = na.omit(farbe)[1],
            gier = mean(na.omit(gier)))
#> # A tibble: 4 x 4
#>      id  abst farbe  gier
#>   <dbl> <dbl> <fct> <dbl>
#> 1     1     1 keine   2.5
#> 2     2     0 keine   0  
#> 3     3     0 keine   0  
#> 4     4     3 rot     3

Created on 2020-05-14 by the reprex package (v0.3.0)

Upvotes: 0

Related Questions