Robin Kohrs
Robin Kohrs

Reputation: 697

Summarise a column and thereby remove unwanted NAs in others

Once again I'm a little stuck and reaching out for help. I hope one day being able to give this help back...

Anyways, I have a tibble that looks like this:

# A tibble: 20 x 6
# Groups:   tipologia [6]
   tipologia                                   date_info pct_day pct_month pct_year pct_no_date
   <chr>                                       <chr>       <dbl>     <dbl>    <dbl>       <dbl>
 1 Aree soggette a crolli/ribaltamenti diffusi day        0.0508  NA         NA         NA     
 2 Aree soggette a crolli/ribaltamenti diffusi month     NA        0.0217    NA         NA     
 3 Aree soggette a crolli/ribaltamenti diffusi no date   NA       NA         NA          0.227 
 4 Aree soggette a crolli/ribaltamenti diffusi year      NA       NA          0.701     NA     
 5 Aree soggette a frane superficiali diffuse  day        0.0721  NA         NA         NA     
 6 Aree soggette a frane superficiali diffuse  month     NA        0.0218    NA         NA     
 7 Aree soggette a frane superficiali diffuse  no date   NA       NA         NA          0.570 
 8 Aree soggette a frane superficiali diffuse  year      NA       NA          0.336     NA     
 9 Aree soggette a sprofondamenti diffusi      day        0.143   NA         NA         NA     
10 Aree soggette a sprofondamenti diffusi      no date   NA       NA         NA          0.286 
11 Aree soggette a sprofondamenti diffusi      year      NA       NA          0.571     NA     
12 Colamento lento                             day        0.119   NA         NA         NA     
13 Colamento lento                             month     NA        0.0475    NA         NA     
14 Colamento lento                             no date   NA       NA         NA          0.122 
15 Colamento lento                             year      NA       NA          0.712     NA     
16 Colamento rapido                            day        0.478   NA         NA         NA     
17 Colamento rapido                            month     NA        0.00838   NA         NA     
18 Colamento rapido                            no date   NA       NA         NA          0.0642
19 Colamento rapido                            year      NA       NA          0.450     NA     
20 Complesso                                   day        0.262   NA         NA         NA     

There are four entries in "tipologia" as there are four possible date-informations (day, year, month or no info at all). What I'd like to have is only one row for each tipologia and essentially removing this unnecessary NAs. The NAs can't have any values, so they are just a little annoying.

I tried a lot with grouping again and summarizing, but did not achieve what I wanted to do. So any idea would be super helpful:)

Upvotes: 0

Views: 65

Answers (2)

jay.sf
jay.sf

Reputation: 73342

You could use aggregate and loop over the columns with lapply, then merge.

Reduce(function(...) merge(..., all=T), lapply(names(dat)[3:6], function(x) 
  aggregate(as.formula(paste(x, "~ tipologia")), dat, I)))
#                                     tipologia pct_day pct_month pct_year pct_no_date
# 1 Aree soggette a crolli/ribaltamenti diffusi  0.0508   0.02170    0.701      0.2270
# 2  Aree soggette a frane superficiali diffuse  0.0721   0.02180    0.336      0.5700
# 3      Aree soggette a sprofondamenti diffusi  0.1430        NA    0.571      0.2860
# 4                             Colamento lento  0.1190   0.04750    0.712      0.1220
# 5                            Colamento rapido  0.4780   0.00838    0.450      0.0642
# 6                                   Complesso  0.2620        NA       NA          NA

Data:

dat <- structure(list(tipologia = c("Aree soggette a crolli/ribaltamenti diffusi", 
"Aree soggette a crolli/ribaltamenti diffusi", "Aree soggette a crolli/ribaltamenti diffusi", 
"Aree soggette a crolli/ribaltamenti diffusi", "Aree soggette a frane superficiali diffuse", 
"Aree soggette a frane superficiali diffuse", "Aree soggette a frane superficiali diffuse", 
"Aree soggette a frane superficiali diffuse", "Aree soggette a sprofondamenti diffusi", 
"Aree soggette a sprofondamenti diffusi", "Aree soggette a sprofondamenti diffusi", 
"Colamento lento", "Colamento lento", "Colamento lento", "Colamento lento", 
"Colamento rapido", "Colamento rapido", "Colamento rapido", "Colamento rapido", 
"Complesso"), date_info = c("day", "month", "no date", "year", 
"day", "month", "no date", "year", "day", "no date", "year", 
"day", "month", "no date", "year", "day", "month", "no date", 
"year", "day"), pct_day = c(0.0508, NA, NA, NA, 0.0721, NA, NA, 
NA, 0.143, NA, NA, 0.119, NA, NA, NA, 0.478, NA, NA, NA, 0.262
), pct_month = c(NA, 0.0217, NA, NA, NA, 0.0218, NA, NA, NA, 
NA, NA, NA, 0.0475, NA, NA, NA, 0.00838, NA, NA, NA), pct_year = c(NA, 
NA, NA, 0.701, NA, NA, NA, 0.336, NA, NA, 0.571, NA, NA, NA, 
0.712, NA, NA, NA, 0.45, NA), pct_no_date = c(NA, NA, 0.227, 
NA, NA, NA, 0.57, NA, NA, 0.286, NA, NA, NA, 0.122, NA, NA, NA, 
0.0642, NA, NA)), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", 
"15", "16", "17", "18", "19", "20"))

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389155

You can use na.omit to drop the NA values.

library(dplyr)
df %>%
  group_by(tipologia) %>%
  summarise(across(starts_with('pct'), na.omit))

na.omit should work for the above data but a little safer option would be :

df %>%
  group_by(tipologia) %>%
  summarise(across(starts_with('pct'), ~.x[!is.na(x)][1]))

Upvotes: 2

Related Questions