Sam Firke
Sam Firke

Reputation: 23004

Splitting string into unknown number of new dataframe columns

I have a dataframe with a character column that contains email metadata in the form of multiple strings delimited by a newline character \n:

  person                                                                                                                                                 myString
1   John                                                                                                            To [email protected] by sender6 on 01-12-2014\n
2   Jane To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n
3    Tim                                                                To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n

I want to split the different substrings of myString into different columns so that it looks like this:

  person                                                     email1                                      email2                                        email3
1   John                To [email protected] by sender6 on 01-12-2014                                        <NA>                                          <NA>
2   Jane To [email protected],[email protected] by sender1 on 01-22-2014 To [email protected] by sender2 on 02-03-2014 To [email protected] by sender1 on 06-21-2014
3    Tim                To [email protected] by sender2 on 05-11-2014  To [email protected] by sender2 on 06-03-2015                                          <NA>

My current approach uses separate from the tidyr package:

library(dplyr)
library(tidyr)
res1 <- df %>% 
    separate(col = myString, into = paste(rep("email", 3), 1:3), sep = "\\n", extra = "drop")
res1[res1 == ""] <- NA

But with this approach, I have to manually specify that there are three columns to extract.

I'm looking to improve this process with either or both of:

  1. A way to automate that calculation of the maximum number of occurrences of the delimiting character (i.e., how many new variables are needed)
  2. Other approaches to splitting into an unknown number of columns

And if there's a good solution that returns data in a long form, instead of wide, that would be great too.

Sample data:

df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To [email protected] by sender6 on 01-12-2014\n", 
    "To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n", 
    "To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n"
    )), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 1

Views: 2478

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

I would suggest cSplit from my "splitstackshape" package:

library(splitstackshape)
cSplit(df, "myString", "\n")
#    person                                                 myString_1
# 1:   John                To [email protected] by sender6 on 01-12-2014
# 2:   Jane To [email protected],[email protected] by sender1 on 01-22-2014
# 3:    Tim                To [email protected] by sender2 on 05-11-2014
#                                     myString_2
# 1:                                          NA
# 2: To [email protected] by sender2 on 02-03-2014
# 3:  To [email protected] by sender2 on 06-03-2015
#                                       myString_3
# 1:                                            NA
# 2: To [email protected] by sender1 on 06-21-2014
# 3:                                            NA

You can also try stri_split_fixed from the "stringi" package with the argument simplify = TRUE (though with your sample data, this adds an extra empty column at the end). The approach would be something like:

library(stringi)
data.frame(person = df$person, 
           stri_split_fixed(df$myString, "\n", 
                            simplify = TRUE))

Upvotes: 4

eddi
eddi

Reputation: 49448

This might suffice:

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

dt[, strsplit(myString, split = "\n"), by = person]
#   person                                                         V1
#1:   John                To [email protected] by sender6 on 01-12-2014
#2:   Jane To [email protected],[email protected] by sender1 on 01-22-2014
#3:   Jane                To [email protected] by sender2 on 02-03-2014
#4:   Jane              To [email protected] by sender1 on 06-21-2014
#5:    Tim                To [email protected] by sender2 on 05-11-2014
#6:    Tim                 To [email protected] by sender2 on 06-03-2015

And then can trivially convert to wide format:

dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
      person ~ idx, value.var = 'V1')
#   person                                                          1                                           2                                             3
#1:   Jane To [email protected],[email protected] by sender1 on 01-22-2014 To [email protected] by sender2 on 02-03-2014 To [email protected] by sender1 on 06-21-2014
#2:   John                To [email protected] by sender6 on 01-12-2014                                          NA                                            NA
#3:    Tim                To [email protected] by sender2 on 05-11-2014  To [email protected] by sender2 on 06-03-2015                                            NA

# (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)

Upvotes: 1

Michael Lawrence
Michael Lawrence

Reputation: 1021

Here is an efficient route to a long form:

a <- strsplit(df$myString, "\n")
lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
longdf <- df[rep(seq_along(a), lens),]
longdf$string <- unlist(a)

Note that stack() is often useful for these cases.

Can be simplified by using the IRanges Bioconductor package:

longdf <- df[togroup(a),]
longdf$string <- unlist(a)

Then, if really necessary, go to wide-form:

longdf$myString <- NULL
longdf$token <- sequence(lens)
widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")

Upvotes: 1

cory
cory

Reputation: 6659

Seems hacky, but here ya go...

Use strsplit to split the char vector. Get the max length, use that for your columns.

df <- data.frame(
  person = c("John", "Jane", "Tim"),
  myString = c("To [email protected] by sender6 on 01-12-2014\n",
               "To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n",
               "To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n"
  ), stringsAsFactors=FALSE
)

a <- strsplit(df$myString, "\n")
max_len <- max(sapply(a, length))
for(i in 1:max_len){
  df[,paste0("email", i)] <- sapply(a, "[", i)
}

Upvotes: 1

Related Questions