Reputation: 11
I have source data like the following
Site_Id name Phone
150 dan 916-654-8585
150 sasha 916-654-8794
150 Sam 916-654-8589
161 Jeff 916-654-8255
161 Danny 916-654-8535
120 Ali 916-654-8575
and i wrote my R code to create a template based on frequency of Site_ID, as the highest frequency is 3 my template is like below
Site_ID name_1 Phone_1 name_2 Phone_2 name_3 Phone_3
I am looking for a way to import all my data into the template somehow
Result I'm looking for:
Site_ID name_1 Phone_1 name_2 Phone_2 name_3 Phone_3
150 dan 916-654-8585 sasha 916-654-8794 Sam 916-654-8589
161 Jeff 916-654-8255 Danny 916-654-8535 NA NA
120 Ali 916-654-8575 NA NA NA NA
Upvotes: 1
Views: 107
Reputation: 11965
library(dplyr)
library(splitstackshape)
df %>%
group_by(Site_Id) %>%
summarise_at(vars("name", "Phone"), funs(paste(., collapse=","))) %>%
cSplit(c("name", "Phone"))
Output is:
Site_Id name_1 name_2 name_3 Phone_1 Phone_2 Phone_3
1: 120 Ali NA NA 916-654-8575 NA NA
2: 150 dan sasha Sam 916-654-8585 916-654-8794 916-654-8589
3: 161 Jeff Danny NA 916-654-8255 916-654-8535 NA
Sample data:
df <- structure(list(Site_Id = c(150L, 150L, 150L, 161L, 161L, 120L
), name = c("dan", "sasha", "Sam", "Jeff", "Danny", "Ali"), Phone = c("916-654-8585",
"916-654-8794", "916-654-8589", "916-654-8255", "916-654-8535",
"916-654-8575")), .Names = c("Site_Id", "name", "Phone"), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1
Reputation: 15072
There might be a better way to do this, not too sure, but here is one solution with the tidyverse
. The approach is to create site index columns for name
and phone
, which become new numbered column names using spread
. Finally, we can use summarise_at
to choose the first non-missing value from each column for each site. There should only be one based on how we set this up, producing the format desired. You can reorder the columns and rows however you like with select
and arrange
.
EDIT: changed to deal with many columns. This one basically wraps some functionality into spread_var
, which has a table and a variable as input and creates first an index column with that variable in title case, and then spreads the values of that variable across the index columns. We make sure to use this function on a table with lower case column names and grouped by site id. The sym()
, enquo
, !!
and :=
are needed to deal with dplyr
's complicated evaluation, but at least it works as written. Might be possible to make it simpler, I'm not super practiced with quosures.
Calling the function is fairly simple, we get the variables to loop through from colnames
and each loop spreads out another variable. Last line is same as before, collapse to the first non missing value. This has the advantage that it can deal with any number of variables per site_id
and any number of variable columns.
EDIT2: Updated with an alternative tidyverse
approach inspired by Prem, but which doesn't deal well with there being more than 3 values per site_id
. However, it does all fit in one pipe, which is nice.
library(tidyverse)
tbl <- read_table2(
"Site_Id name Phone
150 dan 916-654-8585
150 sasha 916-654-8794
150 Sam 916-654-8589
161 Jeff 916-654-8255
161 Danny 916-654-8535
120 Ali 916-654-8575"
)
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 2)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 6 Phone "" embedded null literal data file # A tibble: 1 x 5
spread_var <- function(df, var){
varname <- rlang::sym(str_to_title(var))
expr <- enquo(var)
df %>%
mutate(!!varname := row_number()) %>%
spread(!!varname, !!expr, sep = "_")
}
out <- tbl %>%
magrittr::set_colnames(str_to_lower(colnames(.))) %>%
group_by(site_id)
for (i in colnames(out)[2:ncol(tbl)]){
out <- spread_var(out, i)
}
out %>%
summarise_at(vars(matches("_\\d$")), function(x) x[which(!is.na(x))[1]])
#> # A tibble: 3 x 7
#> site_id Name_1 Name_2 Name_3 Phone_1 Phone_2 Phone_3
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 120 Ali <NA> <NA> 916-654-8575 <NA> <NA>
#> 2 150 dan sasha Sam 916-654-8585 916-654-8589 916-654-8794
#> 3 161 Jeff Danny <NA> 916-654-8255 916-654-8535 <NA>
tbl %>%
magrittr::set_colnames(str_to_lower(colnames(.))) %>%
group_by(site_id) %>%
summarise_all(~ str_c(., collapse = ",")) %>%
imap_dfc(
~separate(
data = tibble(.x),
col = 1,
into = c(str_c(.y, "_1"), str_c(.y, "_2"), str_c(.y, "_3")),
sep = ",",
fill = "right"
)
) %>%
select(-site_id_2, -site_id_3) %>%
rename(site_id = site_id_1)
#> # A tibble: 3 x 7
#> site_id Name_1 Name_2 Name_3 Phone_1 Phone_2 Phone_3
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 120 Ali <NA> <NA> 916-654-8575 <NA> <NA>
#> 2 150 dan sasha Sam 916-654-8585 916-654-8589 916-654-8794
#> 3 161 Jeff Danny <NA> 916-654-8255 916-654-8535 <NA>
Created on 2018-03-07 by the reprex package (v0.2.0).
Upvotes: 1