schultz45
schultz45

Reputation: 31

R function to combine rows based on duplicate times

I have a large dataset that has duplicate times (rows) with data in both row columns that I would like to combine. The data looks like this:

date              P1   PT1  P2   PT2   P3   PT3

5/5/2011@11:40    NA   NA   NA   NA   9.4   10.1

5/5/2011@11:40    5.6  10.2  8.5 10.1  NA   NA

I would like to get to this

date                P1     PT1     P2    PT2    P3    PT3

5/5/2011@11:40    5.6  10.2  8.5 10.1  9.4   10.1

My dataset is 10 minutes data for ten years and the repeats are somewhat random. The @ sign was added to display properly.

I've tried rbind and rbind.row.names to no avail.

Thanks!

Upvotes: 3

Views: 748

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 102720

A data.table option

> setDT(df)[, Map(na.omit, .SD), date]
             date  P1  PT1  P2  PT2  P3  PT3
1: 5/5/2011@11:40 5.6 10.2 8.5 10.1 9.4 10.1

Upvotes: 1

Martin Gal
Martin Gal

Reputation: 16998

You could use tidyr's fill function.

library(dplyr)
library(tidyr)

df %>% 
    group_by(date) %>% 
    fill(-date, .direction = "updown") %>% 
    distinct() %>%
    ungroup()

to get

# A tibble: 1 x 7
  date              P1   PT1    P2   PT2    P3   PT3
  <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5/5/2011 11:40   5.6  10.2   8.5  10.1   9.4  10.1

As mikebader pointed out, there might be an issue with duplicates created while filling the data. To handle this, we could use a custom function:

check_fill <- function(df) {
  
  df <- df %>% 
    group_by(date) %>% 
    fill(-date, .direction = "updown") %>% 
    distinct() 
  
  if (df %>% 
      filter(n() > 1) %>% 
      nrow() > 0) {
    warning("Duplicates detected.")
  }
  
  df %>% 
    ungroup()
}

Used on

df2 <- tribble(~date, ~P1, ~PT1, ~P2, ~PT2, ~P3, ~PT3, 
              "5/5/2011 11:40", NA, NA, NA, NA, 9.4, 10.1, 
              "5/5/2011 11:40", 5.6, 10.2, 8.5, 10.1, 9.5, NA
)

this returns

check_fill(df2)
#> Warning in check_fill(df2): Duplicates detected.
#> # A tibble: 2 x 7
#>   date              P1   PT1    P2   PT2    P3   PT3
#>   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5/5/2011 11:40   5.6  10.2   8.5  10.1   9.4  10.1
#> 2 5/5/2011 11:40   5.6  10.2   8.5  10.1   9.5  10.1

Upvotes: 2

mikebader
mikebader

Reputation: 1299

You can use the summarize() function in dplyr. The following will work, but it does not check for duplicates, it only takes the maximum value for each date.

library(dplyr)
df <- tribble(~date, ~P1, ~PT1, ~P2, ~PT2, ~P3, ~PT3, 
        "5/5/2011@11:40", NA, NA, NA, NA, 9.4, 10.1, 
        "5/5/2011@11:40", 5.6, 10.2, 8.5, 10.1, NA, NA
)

df %>%
    group_by(date) %>%
    summarize(across(starts_with("P"), max, na.rm = TRUE))

In other words, if you are sure that your data include either a number or NA, then it will work.

Upvotes: 3

Related Questions