Economist
Economist

Reputation: 183

Extracting many variables from a single column in R

I'm working on a data cleaning problem where I'm stuck. I've started receiving csv-files in the format shown below and I need to clean it before I can do any analysis. There are several such columns and there can be a couple hundred variables in each cell that need to be extracted.

Original <- structure(list(CustNum = c(0, 1),
                      Sales = c("[1000, 345, Zero, 56]", "[987, 879, 325, 4568]"),
                      Amounts = c("[10, 2, 0, 98]", "[57, 25, 52, 75]"),
                      Number = c("['1', '2', '3', '4']", "['4', '3', '2', '1']"),
                      Identifier = c("A", "B")),
                      row.names = c(NA, -2L),
                      class = c("tbl_df", "tbl", "data.frame"))

What I'm trying to do is wrangle it into this format.

Desired <- tibble(CustNum = c(0, 0, 0, 0, 1, 1, 1, 1),
                  Sales = c(1000, 345, "Zero", 56, 987, 879, 325, 4568),
                  Amounts = c(10, 2, 0, 98, 57, 25, 52, 75),
                  Number = c(1, 2, 3, 4, 4, 3, 2, 1),
                  Identifier = c("A", "A", "A", "A", "B", "B", "B", "B"))

I've tried a number of different variations of the following type but can't get anywhere.

Original$Sales %>%
  str_replace("\\[", "") %>%
  str_replace("\\]", "") %>%
  str_replace("'", "")

It's easy to do the cleaning in Power Query in Excel but would like to find a way to do it in R so I don't have to use several different tools. Can anyone show me how to do this?

Upvotes: 1

Views: 373

Answers (3)

Mike V
Mike V

Reputation: 1364

You can try this approach

library(tidyverse)
library(stringr)
Original2 <- Original %>% 
  mutate_at(vars(Sales, Amounts, Number), ~str_replace_all(., "\\[|\\'|\\]|\\s", "")) %>% 
  separate_rows(c("Sales", "Amounts", "Number"), sep = ",")

#   CustNum Sales Amounts Number Identifier
#     <dbl> <chr> <chr>   <chr>  <chr>     
# 1       0 1000  10      1      A         
# 2       0 345   2       2      A         
# 3       0 Zero  0       3      A         
# 4       0 56    98      4      A         
# 5       1 987   57      4      B         
# 6       1 879   25      3      B         
# 7       1 325   52      2      B         
# 8       1 4568  75      1      B  

Here we replace [ , ' and space, then we use separate_rows() from tidyr package to separate rows. It takes 2 steps to achieve our goal.

Upvotes: 2

Edo
Edo

Reputation: 7818

Try with this:

library(dplyr) # must be version >= 1.0.0
library(stringr)

Original %>%
 mutate(across(everything(), str_remove_all, pattern = "\\[|\\]|\\'")) %>% 
 mutate(across(everything(), str_split, pattern = ",")) %>%                
 tidyr::unnest(everything()) %>% 
 mutate(across(everything(), str_trim)) %>% 
 mutate(across(c(CustNum, Amounts, Number), as.numeric))
# A tibble: 8 x 5
  CustNum Sales Amounts Number Identifier
    <dbl> <chr>   <dbl>  <dbl> <chr>     
1       0 1000       10      1 A         
2       0 345         2      2 A         
3       0 Zero        0      3 A         
4       0 56         98      4 A         
5       1 987        57      4 B         
6       1 879        25      3 B         
7       1 325        52      2 B         
8       1 4568       75      1 B   

Basically:

  • Remove [ ] '
  • Split by ,
  • Unnest the lists
  • Trim out unnecessary spaces
  • Set to numeric where necessary

Upvotes: 2

Duck
Duck

Reputation: 39595

I would suggest this approach reshaping your Original data first to long and then separate the rows by sep=','. After that you will clean the variable to remove some special characters. Therefore, you can create an id variable by group in order to transform data to wide as you want in Desired:

library(tidyverse)
#Reshape
Original %>%
  pivot_longer(cols = -c(CustNum,Identifier)) %>%
  separate_rows(value,sep = ',') %>%
  mutate(value=trimws(gsub("[[:punct:]]", " ", value))) %>%
  group_by(name) %>% mutate(id=1:n()) %>%
  pivot_wider(names_from = name,values_from=value) %>% 
  ungroup() %>%
  select(-id)

Output:

# A tibble: 8 x 5
  CustNum Identifier Sales Amounts Number
    <dbl> <chr>      <chr> <chr>   <chr> 
1       0 A          1000  10      1     
2       0 A          345   2       2     
3       0 A          Zero  0       3     
4       0 A          56    98      4     
5       1 B          987   57      4     
6       1 B          879   25      3     
7       1 B          325   52      2     
8       1 B          4568  75      1  

Upvotes: 1

Related Questions