Reputation: 7526
After importing a csv that had originally been a pandas string array, I am faced with the problem of converting it to a usable form in an R dataframe (sample data below). In R, the array is read in simply as a long string:
fubar
row text
55 55 ['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']
56 56 ['a', 'similares']
57 57 ['parmacias', 'consulte', '_mrtates', 'o', 'a']
58 58 ['centro ', 'aa']
59 59 ['syº_mp', 'descuentoa', 'patente', 'f']
So for example:
> fubar$text[1]
[1] "['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']"
> class(fubar$text)
[1] "character"
I have managed to split the string into its component parts
strsplit(gsub("\\[|\\]", "", fubar$text[1]),",")[[1]]
[1] "'armagoi'" " 'sa pae'" " 'al'" " 'ta'" " 'u'" " 'be'" " 'o'"
But this seems to have surrounded quotes by more quotes, which was unintended. What is the best way to remove these extra quotes, and then store each rows' strings in a list? Is this the best way to store string arrays in a dataframe cell? Is it as simple as looping over every row, converting to list and then reassigning it back to the dataframe, or is a list of lists a better practice to replace the current column? Any suggestions on how to rework this character column into a list column would be appreciated
as.list(strsplit(gsub("\\[|\\]", "", fubar$text[1]),",")[[1]])
Here is the data:
dput(fubar)
structure(list(row = c("55", "56", "57", "58", "59"), text = c("['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']",
"['a', 'similares']", "['parmacias', 'consulte', '_mrtates', 'o', 'a']",
"['centro ', 'aa']", "['syº_mp', 'descuentoa', 'patente', 'f']"
)), class = "data.frame", row.names = 55:59)
Upvotes: 0
Views: 723
Reputation: 887511
We can use ex_square
(from qdapRegex
)
library(dplyr)
library(tidyr)
library(qdapRegex)
fubar %>%
mutate(text = ex_square(text)) %>%
unnest(c(text)) %>%
separate_rows(text, sep=",\\s*")
# row text
# <chr> <chr>
# 1 55 'armagoi'
# 2 55 'sa pae'
# 3 55 'al'
# 4 55 'ta'
# 5 55 'u'
# 6 55 'be'
# 7 55 'o'
# 8 56 'a'
# ...
Upvotes: 0
Reputation: 17785
A very similar answer was given, but I give code for both data.table
and tidyverse
. I consider two options: (1) words are stored in a list-column, where each row contains a vector of words, (2) a data frame with one word per row. This is basically a nest/unnesting pattern.
dat <- structure(list(row = c("55", "56", "57", "58", "59"), text = c("['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']",
"['a', 'similares']", "['parmacias', 'consulte', '_mrtates', 'o', 'a']",
"['centro ', 'aa']", "['syº_mp', 'descuentoa', 'patente', 'f']"
)), class = "data.frame", row.names = 55:59)
library(data.table)
dat_dt <- data.table(dat)[
, text := strsplit(gsub("\\[|\\]| |\\'", "", dat$text), ",")]
# one list of words per row
dat_dt[]
#> row text
#> 1: 55 armagoi,sapae,al,ta,u,be,...
#> 2: 56 a,similares
#> 3: 57 parmacias,consulte,_mrtates,o,a
#> 4: 58 centro,aa
#> 5: 59 syº_mp,descuentoa,patente,f
dat_dt$text[[1]]
#> [1] "armagoi" "sapae" "al" "ta" "u" "be" "o"
# one word per row (unnest)
head(dat_dt[, text[[1]], by=row])
#> row V1
#> 1: 55 armagoi
#> 2: 55 sapae
#> 3: 55 al
#> 4: 55 ta
#> 5: 55 u
#> 6: 55 be
library(dplyr)
library(tidyr)
dat_tb <- dat %>%
mutate(text := strsplit(gsub("\\[|\\]| |\\'", "", dat$text), ","))
dat_tb
#> row text
#> 1 55 armagoi, sapae, al, ta, u, be, o
#> 2 56 a, similares
#> 3 57 parmacias, consulte, _mrtates, o, a
#> 4 58 centro, aa
#> 5 59 syº_mp, descuentoa, patente, f
dat_tb %>% unnest(text) %>% head
#> # A tibble: 6 x 2
#> row text
#> <chr> <chr>
#> 1 55 armagoi
#> 2 55 sapae
#> 3 55 al
#> 4 55 ta
#> 5 55 u
#> 6 55 be
Upvotes: 1
Reputation: 389135
Well, "best" is subjective but here are 3 ways in which you can store the data. You can use one of them based on your requirement further.
Remove any additional characters which are not required using gsub
.
library(dplyr)
library(tidyr)
fubar %>% mutate(text = gsub("\\[|'|\\]", "", text))
# row text
#1 55 armagoi, sa pae, al, ta, u, be, o
#2 56 a, similares
#3 57 parmacias, consulte, _mrtates, o, a
#4 58 centro , aa
#5 59 syº_mp, descuentoa, patente, f
Split strings on comma and treat every word as a new row.
fubar %>%
mutate(text = gsub("\\[|'|\\]", "", text)) %>%
separate_rows(text , sep = ',\\s*')
# row text
# <chr> <chr>
# 1 55 "armagoi"
# 2 55 "sa pae"
# 3 55 "al"
# 4 55 "ta"
# 5 55 "u"
# 6 55 "be"
#...
#...
fubar %>%
mutate(text = gsub("\\[|'|\\]", "", text)) %>%
separate_rows(text , sep = ',\\s*') %>%
group_by(row) %>%
summarise(text = list(text))
# row text
# <chr> <list>
#1 55 <chr [7]>
#2 56 <chr [2]>
#3 57 <chr [5]>
#4 58 <chr [2]>
#5 59 <chr [4]>
Personally, I would prefer option 2 because it is flexible.
Upvotes: 3