Saman Makhzani
Saman Makhzani

Reputation: 11

Data merging and cleaning in R

I have source data like the following

Site_Id    name    Phone
150        dan     916-654-8585
150        sasha   916-654-8794
150        Sam     916-654-8589
161        Jeff    916-654-8255
161        Danny   916-654-8535
120        Ali     916-654-8575

and i wrote my R code to create a template based on frequency of Site_ID, as the highest frequency is 3 my template is like below

Site_ID  name_1  Phone_1  name_2  Phone_2  name_3  Phone_3

I am looking for a way to import all my data into the template somehow

Result I'm looking for:

Site_ID  name_1  Phone_1       name_2  Phone_2        name_3  Phone_3
150     dan     916-654-8585   sasha  916-654-8794    Sam     916-654-8589
161     Jeff    916-654-8255   Danny  916-654-8535    NA      NA
120     Ali     916-654-8575   NA     NA              NA      NA

Upvotes: 1

Views: 107

Answers (2)

Prem
Prem

Reputation: 11965

library(dplyr)
library(splitstackshape)

df %>%
  group_by(Site_Id) %>%
  summarise_at(vars("name", "Phone"), funs(paste(., collapse=","))) %>%
  cSplit(c("name", "Phone"))

Output is:

   Site_Id name_1 name_2 name_3      Phone_1      Phone_2      Phone_3
1:     120    Ali     NA     NA 916-654-8575           NA           NA
2:     150    dan  sasha    Sam 916-654-8585 916-654-8794 916-654-8589
3:     161   Jeff  Danny     NA 916-654-8255 916-654-8535           NA

Sample data:

df <- structure(list(Site_Id = c(150L, 150L, 150L, 161L, 161L, 120L
), name = c("dan", "sasha", "Sam", "Jeff", "Danny", "Ali"), Phone = c("916-654-8585", 
"916-654-8794", "916-654-8589", "916-654-8255", "916-654-8535", 
"916-654-8575")), .Names = c("Site_Id", "name", "Phone"), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

Calum You
Calum You

Reputation: 15072

There might be a better way to do this, not too sure, but here is one solution with the tidyverse. The approach is to create site index columns for name and phone, which become new numbered column names using spread. Finally, we can use summarise_at to choose the first non-missing value from each column for each site. There should only be one based on how we set this up, producing the format desired. You can reorder the columns and rows however you like with select and arrange.

EDIT: changed to deal with many columns. This one basically wraps some functionality into spread_var, which has a table and a variable as input and creates first an index column with that variable in title case, and then spreads the values of that variable across the index columns. We make sure to use this function on a table with lower case column names and grouped by site id. The sym(), enquo, !! and := are needed to deal with dplyr's complicated evaluation, but at least it works as written. Might be possible to make it simpler, I'm not super practiced with quosures.

Calling the function is fairly simple, we get the variables to loop through from colnames and each loop spreads out another variable. Last line is same as before, collapse to the first non missing value. This has the advantage that it can deal with any number of variables per site_id and any number of variable columns.

EDIT2: Updated with an alternative tidyverse approach inspired by Prem, but which doesn't deal well with there being more than 3 values per site_id. However, it does all fit in one pipe, which is nice.

library(tidyverse)
tbl <- read_table2(
  "Site_Id    name    Phone
150        dan     916-654-8585
150        sasha     916-654-8794
150        Sam     916-654-8589
161        Jeff    916-654-8255
161        Danny    916-654-8535
120        Ali     916-654-8575"
)
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 2)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 x 5 col     row col   expected actual        file         expected   <int> <chr> <chr>    <chr>         <chr>        actual 1     6 Phone ""       embedded null literal data file # A tibble: 1 x 5

spread_var <- function(df, var){
  varname <- rlang::sym(str_to_title(var))
  expr <- enquo(var)
  df %>%
    mutate(!!varname := row_number()) %>%
    spread(!!varname, !!expr, sep = "_")
}

out <- tbl %>%
  magrittr::set_colnames(str_to_lower(colnames(.))) %>%
  group_by(site_id)
for (i in colnames(out)[2:ncol(tbl)]){
  out <- spread_var(out, i)
}
out %>%
  summarise_at(vars(matches("_\\d$")), function(x) x[which(!is.na(x))[1]])
#> # A tibble: 3 x 7
#>   site_id Name_1 Name_2 Name_3 Phone_1      Phone_2      Phone_3     
#>     <int> <chr>  <chr>  <chr>  <chr>        <chr>        <chr>       
#> 1     120 Ali    <NA>   <NA>   916-654-8575 <NA>         <NA>        
#> 2     150 dan    sasha  Sam    916-654-8585 916-654-8589 916-654-8794
#> 3     161 Jeff   Danny  <NA>   916-654-8255 916-654-8535 <NA>

tbl %>%
  magrittr::set_colnames(str_to_lower(colnames(.))) %>%
  group_by(site_id) %>%
  summarise_all(~ str_c(., collapse = ",")) %>%
  imap_dfc(
    ~separate(
      data = tibble(.x),
      col = 1,
      into = c(str_c(.y, "_1"), str_c(.y, "_2"), str_c(.y, "_3")),
      sep = ",",
      fill = "right"
    )
  ) %>%
  select(-site_id_2, -site_id_3) %>%
  rename(site_id = site_id_1)
    #> # A tibble: 3 x 7
    #>   site_id Name_1 Name_2 Name_3 Phone_1      Phone_2      Phone_3     
    #>     <int> <chr>  <chr>  <chr>  <chr>        <chr>        <chr>       
    #> 1     120 Ali    <NA>   <NA>   916-654-8575 <NA>         <NA>        
    #> 2     150 dan    sasha  Sam    916-654-8585 916-654-8589 916-654-8794
    #> 3     161 Jeff   Danny  <NA>   916-654-8255 916-654-8535 <NA>

Created on 2018-03-07 by the reprex package (v0.2.0).

Upvotes: 1

Related Questions