Reputation: 31
I have a large dataset that has duplicate times (rows) with data in both row columns that I would like to combine. The data looks like this:
date P1 PT1 P2 PT2 P3 PT3
5/5/2011@11:40 NA NA NA NA 9.4 10.1
5/5/2011@11:40 5.6 10.2 8.5 10.1 NA NA
I would like to get to this
date P1 PT1 P2 PT2 P3 PT3
5/5/2011@11:40 5.6 10.2 8.5 10.1 9.4 10.1
My dataset is 10 minutes data for ten years and the repeats are somewhat random. The @
sign was added to display properly.
I've tried rbind
and rbind.row.names
to no avail.
Thanks!
Upvotes: 3
Views: 748
Reputation: 102720
A data.table
option
> setDT(df)[, Map(na.omit, .SD), date]
date P1 PT1 P2 PT2 P3 PT3
1: 5/5/2011@11:40 5.6 10.2 8.5 10.1 9.4 10.1
Upvotes: 1
Reputation: 16998
You could use tidyr
's fill
function.
library(dplyr)
library(tidyr)
df %>%
group_by(date) %>%
fill(-date, .direction = "updown") %>%
distinct() %>%
ungroup()
to get
# A tibble: 1 x 7
date P1 PT1 P2 PT2 P3 PT3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5/5/2011 11:40 5.6 10.2 8.5 10.1 9.4 10.1
As mikebader pointed out, there might be an issue with duplicates created while filling the data. To handle this, we could use a custom function:
check_fill <- function(df) {
df <- df %>%
group_by(date) %>%
fill(-date, .direction = "updown") %>%
distinct()
if (df %>%
filter(n() > 1) %>%
nrow() > 0) {
warning("Duplicates detected.")
}
df %>%
ungroup()
}
Used on
df2 <- tribble(~date, ~P1, ~PT1, ~P2, ~PT2, ~P3, ~PT3,
"5/5/2011 11:40", NA, NA, NA, NA, 9.4, 10.1,
"5/5/2011 11:40", 5.6, 10.2, 8.5, 10.1, 9.5, NA
)
this returns
check_fill(df2)
#> Warning in check_fill(df2): Duplicates detected.
#> # A tibble: 2 x 7
#> date P1 PT1 P2 PT2 P3 PT3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5/5/2011 11:40 5.6 10.2 8.5 10.1 9.4 10.1
#> 2 5/5/2011 11:40 5.6 10.2 8.5 10.1 9.5 10.1
Upvotes: 2
Reputation: 1299
You can use the summarize()
function in dplyr. The following will work, but it does not check for duplicates, it only takes the maximum value for each date.
library(dplyr)
df <- tribble(~date, ~P1, ~PT1, ~P2, ~PT2, ~P3, ~PT3,
"5/5/2011@11:40", NA, NA, NA, NA, 9.4, 10.1,
"5/5/2011@11:40", 5.6, 10.2, 8.5, 10.1, NA, NA
)
df %>%
group_by(date) %>%
summarize(across(starts_with("P"), max, na.rm = TRUE))
In other words, if you are sure that your data include either a number or NA
, then it will work.
Upvotes: 3