Reputation: 587
I have a dataset with some duplicate entries that I want to change to include only unique combinations of values, with a dup_num
column to indicate the number of duplicate entries, and a dup_rows
column to indicate which rows contain duplicate data.
I implemented a solution based on Finding duplicate observations of selected variables in a tibble , but it throws a mess of warnings when coercing data in the column containing the list of row numbers to a character vector. Not a problem now, but I want to show this data with DT and Shiny and the warnings are a problem for this application.
library(tidyverse)
df <- tibble(episode = 1:30,
day = rep(c("Mon", "Wed", "Fri"), 10),
name = rep(c(
"Moe", "Larry", "Curly", "Shemp", "extra"
), 6))
chr_dups <- as_mapper( ~ str_c(.x) %>%
str_remove_all("[c\\(\\)]"))
df %>%
nest(episode, .key = "dups") %>%
mutate(dup_num = map_dbl(dups, nrow),
dup_rows = map_chr(dups, chr_dups))
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> # A tibble: 15 x 5
#> day name dups dup_num dup_rows
#> <chr> <chr> <list> <dbl> <chr>
#> 1 Mon Moe <tibble [2 x 1]> 2 1, 16
#> 2 Wed Larry <tibble [2 x 1]> 2 2, 17
#> 3 Fri Curly <tibble [2 x 1]> 2 3, 18
#> 4 Mon Shemp <tibble [2 x 1]> 2 4, 19
#> 5 Wed extra <tibble [2 x 1]> 2 5, 20
#> 6 Fri Moe <tibble [2 x 1]> 2 6, 21
#> 7 Mon Larry <tibble [2 x 1]> 2 7, 22
#> 8 Wed Curly <tibble [2 x 1]> 2 8, 23
#> 9 Fri Shemp <tibble [2 x 1]> 2 9, 24
#> 10 Mon extra <tibble [2 x 1]> 2 10, 25
#> 11 Wed Moe <tibble [2 x 1]> 2 11, 26
#> 12 Fri Larry <tibble [2 x 1]> 2 12, 27
#> 13 Mon Curly <tibble [2 x 1]> 2 13, 28
#> 14 Wed Shemp <tibble [2 x 1]> 2 14, 29
#> 15 Fri extra <tibble [2 x 1]> 2 15, 30
Created on 2019-09-19 by the reprex package (v0.3.0)
I am pretty sure that the problem is in as_mapper()
.
Below is a reprex with representative toy data. The tibble describes some episodes from the Three Stooges, the day the episode ran, and the character who was the protagonist for the episode.
Thanks!
Upvotes: 1
Views: 372
Reputation: 887108
It is a warning because the list
elements are not atomic, i.e. it is a list
of tibble
which can be identified, if we pull
the column
df %>%
nest(dups = episode) %>%
pull(dups)
#<list_of<tbl_df<episode:integer>>[15]>
#[[1]]
# A tibble: 2 x 1
# episode
# <int>
#1 1
#2 16
#[[2]]
# A tibble: 2 x 1
# episode
3 <int>
#1 2
#2 17
# ...
So, it is a list
of tibble
. either we can extract the column with pull
or we can flatten
it and apply the function
library(purrr)
df %>%
nest(dups = episode) %>%
mutate(dup_num = map_dbl(dups, nrow),
dup_rows = map(dups, ~ flatten_int(.x) %>%
chr_dups))
NOTE: It is not clear why the function 'chr_dups' is applied on the 'episode' column which is numeric. The transformations are also not making sense
If we just need to paste
the elements of 'episode' grouped by the other columns, a base R
single line approach is
aggregate(episode~ day + name, df, toString)
# day name episode
#1 Fri Curly 3, 18
#2 Mon Curly 13, 28
#3 Wed Curly 8, 23
#4 Fri extra 15, 30
#5 Mon extra 10, 25
#6 Wed extra 5, 20
#7 Fri Larry 12, 27
#8 Mon Larry 7, 22
#9 Wed Larry 2, 17
#10 Fri Moe 6, 21
#11 Mon Moe 1, 16
#12 Wed Moe 11, 26
#13 Fri Shemp 9, 24
#14 Mon Shemp 4, 19
#15 Wed Shemp 14, 29
Upvotes: 3
Reputation: 15072
I think the source of the warning has already been addressed. I'll add that you can do this without mapping, using just vectorised functions.
library(tidyverse)
df <- tibble(episode = 1:30,
day = rep(c("Mon", "Wed", "Fri"), 10),
name = rep(c(
"Moe", "Larry", "Curly", "Shemp", "extra"
), 6))
df %>%
group_by(day, name) %>%
summarise(
dup_num = n(),
dup_rows = str_c(episode, collapse = ", ")
)
#> # A tibble: 15 x 4
#> # Groups: day [3]
#> day name dup_num dup_rows
#> <chr> <chr> <int> <chr>
#> 1 Fri Curly 2 3, 18
#> 2 Fri extra 2 15, 30
#> 3 Fri Larry 2 12, 27
#> 4 Fri Moe 2 6, 21
#> 5 Fri Shemp 2 9, 24
#> 6 Mon Curly 2 13, 28
#> 7 Mon extra 2 10, 25
#> 8 Mon Larry 2 7, 22
#> 9 Mon Moe 2 1, 16
#> 10 Mon Shemp 2 4, 19
#> 11 Wed Curly 2 8, 23
#> 12 Wed extra 2 5, 20
#> 13 Wed Larry 2 2, 17
#> 14 Wed Moe 2 11, 26
#> 15 Wed Shemp 2 14, 29
Created on 2019-09-19 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 4233
Just adding to other posters. You don't have to use purrr
to achieve what you want. Base R will do.
df <- df %>%
nest(episode, .key = "dups") %>%
mutate(dup_num = sapply(dups, nrow),
dup_rows = sapply(dups, function(x) paste0(x$episode, collapse = ",")))
Upvotes: 1