Reputation: 183
I'm working on a data cleaning problem where I'm stuck. I've started receiving csv-files in the format shown below and I need to clean it before I can do any analysis. There are several such columns and there can be a couple hundred variables in each cell that need to be extracted.
Original <- structure(list(CustNum = c(0, 1),
Sales = c("[1000, 345, Zero, 56]", "[987, 879, 325, 4568]"),
Amounts = c("[10, 2, 0, 98]", "[57, 25, 52, 75]"),
Number = c("['1', '2', '3', '4']", "['4', '3', '2', '1']"),
Identifier = c("A", "B")),
row.names = c(NA, -2L),
class = c("tbl_df", "tbl", "data.frame"))
What I'm trying to do is wrangle it into this format.
Desired <- tibble(CustNum = c(0, 0, 0, 0, 1, 1, 1, 1),
Sales = c(1000, 345, "Zero", 56, 987, 879, 325, 4568),
Amounts = c(10, 2, 0, 98, 57, 25, 52, 75),
Number = c(1, 2, 3, 4, 4, 3, 2, 1),
Identifier = c("A", "A", "A", "A", "B", "B", "B", "B"))
I've tried a number of different variations of the following type but can't get anywhere.
Original$Sales %>%
str_replace("\\[", "") %>%
str_replace("\\]", "") %>%
str_replace("'", "")
It's easy to do the cleaning in Power Query in Excel but would like to find a way to do it in R so I don't have to use several different tools. Can anyone show me how to do this?
Upvotes: 1
Views: 373
Reputation: 1364
You can try this approach
library(tidyverse)
library(stringr)
Original2 <- Original %>%
mutate_at(vars(Sales, Amounts, Number), ~str_replace_all(., "\\[|\\'|\\]|\\s", "")) %>%
separate_rows(c("Sales", "Amounts", "Number"), sep = ",")
# CustNum Sales Amounts Number Identifier
# <dbl> <chr> <chr> <chr> <chr>
# 1 0 1000 10 1 A
# 2 0 345 2 2 A
# 3 0 Zero 0 3 A
# 4 0 56 98 4 A
# 5 1 987 57 4 B
# 6 1 879 25 3 B
# 7 1 325 52 2 B
# 8 1 4568 75 1 B
Here we replace [ , ' and space, then we use separate_rows()
from tidyr
package to separate rows. It takes 2 steps to achieve our goal.
Upvotes: 2
Reputation: 7818
Try with this:
library(dplyr) # must be version >= 1.0.0
library(stringr)
Original %>%
mutate(across(everything(), str_remove_all, pattern = "\\[|\\]|\\'")) %>%
mutate(across(everything(), str_split, pattern = ",")) %>%
tidyr::unnest(everything()) %>%
mutate(across(everything(), str_trim)) %>%
mutate(across(c(CustNum, Amounts, Number), as.numeric))
# A tibble: 8 x 5
CustNum Sales Amounts Number Identifier
<dbl> <chr> <dbl> <dbl> <chr>
1 0 1000 10 1 A
2 0 345 2 2 A
3 0 Zero 0 3 A
4 0 56 98 4 A
5 1 987 57 4 B
6 1 879 25 3 B
7 1 325 52 2 B
8 1 4568 75 1 B
Basically:
[
]
'
,
Upvotes: 2
Reputation: 39595
I would suggest this approach reshaping your Original
data first to long and then separate the rows by sep=','
. After that you will clean the variable to remove some special characters. Therefore, you can create an id variable by group in order to transform data to wide as you want in Desired
:
library(tidyverse)
#Reshape
Original %>%
pivot_longer(cols = -c(CustNum,Identifier)) %>%
separate_rows(value,sep = ',') %>%
mutate(value=trimws(gsub("[[:punct:]]", " ", value))) %>%
group_by(name) %>% mutate(id=1:n()) %>%
pivot_wider(names_from = name,values_from=value) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 8 x 5
CustNum Identifier Sales Amounts Number
<dbl> <chr> <chr> <chr> <chr>
1 0 A 1000 10 1
2 0 A 345 2 2
3 0 A Zero 0 3
4 0 A 56 98 4
5 1 B 987 57 4
6 1 B 879 25 3
7 1 B 325 52 2
8 1 B 4568 75 1
Upvotes: 1