Reputation: 113
I am working with a dataset where one variable has a number of variables within it. I need to parse through the text to extract each variable value in R.
Current:
|-------------|-------------|-----------------------------------------|
| Customer | Group | Extra |
|-------------|-------------|-----------------------------------------|
| 1 | A1 | {"Field1":"A","Field2":"B","Field3":"C"}|
|-------------|-------------|-----------------------------------------|
| 2 | A2 | {"Field1":"D","Field2":"E","Field3":"F"}|
|-------------|-------------|-----------------------------------------|
| 3 | A3 | {"Field1":"A","Field2":"G","Field3":"D"}|
|-------------|-------------|-----------------------------------------|
Desired:
|-------------|-------------|------------|-----------|-----------|
| Customer | Group | Field1 | Field2 | Field3 |
|-------------|-------------|------------|-----------|-----------|
| 1 | A1 | A | B | C |
|-------------|-------------|------------|-----------|-----------|
| 2 | A2 | D | E | F |
|-------------|-------------|------------|-----------|-----------|
| 3 | A3 | A | G | D |
|-------------|-------------|------------|-----------|-----------|
This seems like it should be easy, but I haven't been able to find a solution. Thank you for any help you can provide! I'm still new to R :)
Upvotes: 0
Views: 41
Reputation: 388862
We can perform some data cleaning, split the data on comma (,
) into separate rows and different columns on colon (:
) and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(Extra = gsub('[{}]', '', Extra)) %>%
separate_rows(Extra, sep = ',') %>%
separate(Extra, c('col', 'value'), sep = ':') %>%
pivot_wider(names_from = col, values_from = value)
# Customer Group Field1 Field2 Field3
# <int> <chr> <chr> <chr> <chr>
#1 1 A1 A B C
#2 2 A2 D E F
#3 3 A3 A G D
data
df <- structure(list(Customer = 1:3, Group = c("A1", "A2", "A3"),
Extra = c("{Field1:A,Field2:B,Field3:C}",
"{Field1:D,Field2:E,Field3:F}", "{Field1:A,Field2:G,Field3:D}"
)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1