iskandarblue
iskandarblue

Reputation: 7526

Best way to store varying arrays of strings in R dataframe

After importing a csv that had originally been a pandas string array, I am faced with the problem of converting it to a usable form in an R dataframe (sample data below). In R, the array is read in simply as a long string:

fubar
   row                                              text
55  55 ['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']
56  56                                ['a', 'similares']
57  57   ['parmacias', 'consulte', '_mrtates', 'o', 'a']
58  58                                 ['centro ', 'aa']
59  59          ['syº_mp', 'descuentoa', 'patente', 'f']

So for example:

> fubar$text[1]
[1] "['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']"
> class(fubar$text)
[1] "character"

I have managed to split the string into its component parts

 strsplit(gsub("\\[|\\]", "", fubar$text[1]),",")[[1]]
[1] "'armagoi'" " 'sa pae'" " 'al'"     " 'ta'"     " 'u'"      " 'be'"     " 'o'"

But this seems to have surrounded quotes by more quotes, which was unintended. What is the best way to remove these extra quotes, and then store each rows' strings in a list? Is this the best way to store string arrays in a dataframe cell? Is it as simple as looping over every row, converting to list and then reassigning it back to the dataframe, or is a list of lists a better practice to replace the current column? Any suggestions on how to rework this character column into a list column would be appreciated

as.list(strsplit(gsub("\\[|\\]", "", fubar$text[1]),",")[[1]])

Here is the data:

dput(fubar)
structure(list(row = c("55", "56", "57", "58", "59"), text = c("['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']", 
"['a', 'similares']", "['parmacias', 'consulte', '_mrtates', 'o', 'a']", 
"['centro ', 'aa']", "['syº_mp', 'descuentoa', 'patente', 'f']"
)), class = "data.frame", row.names = 55:59)

Upvotes: 0

Views: 723

Answers (3)

akrun
akrun

Reputation: 887511

We can use ex_square (from qdapRegex)

library(dplyr)
library(tidyr)
library(qdapRegex)
fubar %>%
   mutate(text = ex_square(text)) %>% 
   unnest(c(text)) %>% 
   separate_rows(text, sep=",\\s*")
#     row   text        
#   <chr> <chr>       
# 1 55    'armagoi'   
# 2 55    'sa pae'    
# 3 55    'al'        
# 4 55    'ta'        
# 5 55    'u'         
# 6 55    'be'        
# 7 55    'o'         
# 8 56    'a'         
# ...

Upvotes: 0

Vincent
Vincent

Reputation: 17785

A very similar answer was given, but I give code for both data.table and tidyverse. I consider two options: (1) words are stored in a list-column, where each row contains a vector of words, (2) a data frame with one word per row. This is basically a nest/unnesting pattern.

dat <- structure(list(row = c("55", "56", "57", "58", "59"), text = c("['armagoi', 'sa pae', 'al', 'ta', 'u', 'be', 'o']", 
"['a', 'similares']", "['parmacias', 'consulte', '_mrtates', 'o', 'a']", 
"['centro ', 'aa']", "['syº_mp', 'descuentoa', 'patente', 'f']"
)), class = "data.frame", row.names = 55:59)

data.table

library(data.table)
dat_dt <- data.table(dat)[
          , text := strsplit(gsub("\\[|\\]| |\\'", "", dat$text), ",")]

# one list of words per row
dat_dt[]
#>    row                            text
#> 1:  55    armagoi,sapae,al,ta,u,be,...
#> 2:  56                     a,similares
#> 3:  57 parmacias,consulte,_mrtates,o,a
#> 4:  58                       centro,aa
#> 5:  59     syº_mp,descuentoa,patente,f
dat_dt$text[[1]]
#> [1] "armagoi" "sapae"   "al"      "ta"      "u"       "be"      "o"

# one word per row (unnest)
head(dat_dt[, text[[1]], by=row])
#>    row      V1
#> 1:  55 armagoi
#> 2:  55   sapae
#> 3:  55      al
#> 4:  55      ta
#> 5:  55       u
#> 6:  55      be

tidyverse

library(dplyr)
library(tidyr)

dat_tb <- dat %>%
          mutate(text := strsplit(gsub("\\[|\\]| |\\'", "", dat$text), ","))
dat_tb
#>   row                                text
#> 1  55    armagoi, sapae, al, ta, u, be, o
#> 2  56                        a, similares
#> 3  57 parmacias, consulte, _mrtates, o, a
#> 4  58                          centro, aa
#> 5  59      syº_mp, descuentoa, patente, f

dat_tb %>% unnest(text) %>% head
#> # A tibble: 6 x 2
#>   row   text   
#>   <chr> <chr>  
#> 1 55    armagoi
#> 2 55    sapae  
#> 3 55    al     
#> 4 55    ta     
#> 5 55    u      
#> 6 55    be

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

Well, "best" is subjective but here are 3 ways in which you can store the data. You can use one of them based on your requirement further.

  1. Comma separated string :

Remove any additional characters which are not required using gsub.

library(dplyr)
library(tidyr)

fubar %>% mutate(text = gsub("\\[|'|\\]", "", text))

#  row                                text
#1  55   armagoi, sa pae, al, ta, u, be, o
#2  56                        a, similares
#3  57 parmacias, consulte, _mrtates, o, a
#4  58                         centro , aa
#5  59      syº_mp, descuentoa, patente, f
  1. Long format (tidy)

Split strings on comma and treat every word as a new row.

fubar %>%
  mutate(text = gsub("\\[|'|\\]", "", text)) %>%
  separate_rows(text , sep = ',\\s*')

#  row   text        
#   <chr> <chr>       
# 1 55    "armagoi"   
# 2 55    "sa pae"    
# 3 55    "al"        
# 4 55    "ta"        
# 5 55    "u"         
# 6 55    "be" 
#...
#...       
  1. In a list.
fubar %>%
  mutate(text = gsub("\\[|'|\\]", "", text)) %>%
  separate_rows(text , sep = ',\\s*') %>%
  group_by(row) %>%
  summarise(text = list(text))

#  row   text     
#  <chr> <list>   
#1 55    <chr [7]>
#2 56    <chr [2]>
#3 57    <chr [5]>
#4 58    <chr [2]>
#5 59    <chr [4]>

Personally, I would prefer option 2 because it is flexible.

Upvotes: 3

Related Questions