anba
anba

Reputation: 583

R - splitting column with different length of rows

I've got data.frame like the one below

ID  age location
1   83  country=X;province=A;city=J
2   15  country=X;city=K
3   2   country=Y;province=B;city=I
4   12  country=X;city=L
5   2   country=Y;city=J
6   2   country=Y;province=A;city=M
7   18  country=X;province=B;city=J
8   85  country=X;province=A;city=I

To describe it: there is third column (location) in which records contain strings separated by ";" and have different lengths.

As a result, I need to obtain the data.frame as such:

ID  age country city
1   83  X       J
2   15  X       K
3   2   Y       I
4   12  X       L
5   2   Y       J
6   2   Y       M
7   18  X       J
8   85  X       I

To describe - I need to separate the column and choose only records about country and city (without province). separate from dplyr allow only to divide by ";" but there is different number of ";" in rows. What should I do?

Upvotes: 3

Views: 913

Answers (4)

tmfmnk
tmfmnk

Reputation: 39858

A different tidyverse possibility could be:

df %>%
 separate(location, c("country", "city"), sep = ";") %>%
 mutate_at(3:4, funs(sub(".*=", "", .)))

  ID age country city
1  1  83       X    A
2  2  15       X    K
3  3   2       Y    B
4  4  12       X    L
5  5   2       Y    J
6  6   2       Y    A
7  7  18       X    B
8  8  85       X    A

In the first step it separates "location" into "country" and "city" based on ;. Then, it extract the elements after = from the newly created columns.

Upvotes: 1

utubun
utubun

Reputation: 4520

You can do it with tidyr::extract function:

library(tidyverse)

extract(
  data  = dat,
  col   = location,
  into  = c('country', 'city'),
  regex = "^country=([[:alpha:]]+).*city=([[:alpha:]]+)$"
  )

  ID age country city
1  1  83       X    J
2  2  15       X    K
3  3   2       Y    I
4  4  12       X    L
5  5   2       Y    J
6  6   2       Y    M
7  7  18       X    J
8  8  85       X    I

Data

dat <- read.table(
  text = "ID  age location
1   83  country=X;province=A;city=J
2   15  country=X;city=K
3   2   country=Y;province=B;city=I
4   12  country=X;city=L
5   2   country=Y;city=J
6   2   country=Y;province=A;city=M
7   18  country=X;province=B;city=J
8   85  country=X;province=A;city=I",
  header = T, stringsAsFactors = F
)

Upvotes: 2

mr.joshuagordon
mr.joshuagordon

Reputation: 764

df <- read.table(text='ID  age location
                 1   83  country=X;province=A;city=J
                 2   15  country=X;city=K
                 3   2   country=Y;province=B;city=I
                 4   12  country=X;city=L
                 5   2   country=Y;city=J
                 6   2   country=Y;province=A;city=M
                 7   18  country=X;province=B;city=J
                 8   85  country=X;province=A;city=I
                 ', header= T)




my_fun <- function(x){
  y <- as.data.frame(strsplit(x,'='))
  names(y) <- as.character(unlist(y[1,]))
  y <- y[-1,]
}

f <- strsplit(as.character(df$location), ";")
s <- data.table::rbindlist(lapply(f,my_fun), fill=T)

df$location <- NULL
df <- cbind(df, s)
df
  ID age country province city
1  1  83       X        A    J
2  2  15       X     <NA>    K
3  3   2       Y        B    I
4  4  12       X     <NA>    L
5  5   2       Y     <NA>    J
6  6   2       Y        A    M
7  7  18       X        B    J
8  8  85       X        A    I

Upvotes: 1

Calum You
Calum You

Reputation: 15062

You can do this with separate_rows and some subsequent reshaping. separate_rows puts things into long format, so we can do some filtering and then spread to get the desired result:

library(tidyverse)
tbl <- read_table2(
  "ID  age location
1   83  country=X;province=A;city=J
2   15  country=X;city=K
3   2   country=Y;province=B;city=I
4   12  country=X;city=L
5   2   country=Y;city=J
6   2   country=Y;province=A;city=M
7   18  country=X;province=B;city=J
8   85  country=X;province=A;city=I"
)

tbl %>%
  separate_rows(location, sep = ";") %>%
  separate(location, c("location_type", "value")) %>%
  filter(location_type %in% c("country", "city")) %>%
  spread(location_type, value)
#> # A tibble: 8 x 4
#>      ID   age city  country
#>   <dbl> <dbl> <chr> <chr>  
#> 1     1    83 J     X      
#> 2     2    15 K     X      
#> 3     3     2 I     Y      
#> 4     4    12 L     X      
#> 5     5     2 J     Y      
#> 6     6     2 M     Y      
#> 7     7    18 J     X      
#> 8     8    85 I     X

Because you only have two cases to deal with, you might find it faster or easier to just use regex to directly extract the values of interest:

tbl %>%
  mutate(
    country = str_extract(location, "(?<=country\\=)."),
    city = str_extract(location, "(?<=city\\=).")
  )
#> # A tibble: 8 x 5
#>      ID   age location                    country city 
#>   <dbl> <dbl> <chr>                       <chr>   <chr>
#> 1     1    83 country=X;province=A;city=J X       J    
#> 2     2    15 country=X;city=K            X       K    
#> 3     3     2 country=Y;province=B;city=I Y       I    
#> 4     4    12 country=X;city=L            X       L    
#> 5     5     2 country=Y;city=J            Y       J    
#> 6     6     2 country=Y;province=A;city=M Y       M    
#> 7     7    18 country=X;province=B;city=J X       J    
#> 8     8    85 country=X;province=A;city=I X       I

Created on 2019-02-25 by the reprex package (v0.2.1)

Upvotes: 2

Related Questions