Reputation: 23004
I have a dataframe with a character column that contains email metadata in the form of multiple strings delimited by a newline character \n
:
person myString
1 John To [email protected] by sender6 on 01-12-2014\n
2 Jane To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n
3 Tim To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n
I want to split the different substrings of myString into different columns so that it looks like this:
person email1 email2 email3
1 John To [email protected] by sender6 on 01-12-2014 <NA> <NA>
2 Jane To [email protected],[email protected] by sender1 on 01-22-2014 To [email protected] by sender2 on 02-03-2014 To [email protected] by sender1 on 06-21-2014
3 Tim To [email protected] by sender2 on 05-11-2014 To [email protected] by sender2 on 06-03-2015 <NA>
My current approach uses separate
from the tidyr package:
library(dplyr)
library(tidyr)
res1 <- df %>%
separate(col = myString, into = paste(rep("email", 3), 1:3), sep = "\\n", extra = "drop")
res1[res1 == ""] <- NA
But with this approach, I have to manually specify that there are three columns to extract.
I'm looking to improve this process with either or both of:
And if there's a good solution that returns data in a long form, instead of wide, that would be great too.
Sample data:
df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To [email protected] by sender6 on 01-12-2014\n",
"To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n",
"To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n"
)), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 1
Views: 2478
Reputation: 193517
I would suggest cSplit
from my "splitstackshape" package:
library(splitstackshape)
cSplit(df, "myString", "\n")
# person myString_1
# 1: John To [email protected] by sender6 on 01-12-2014
# 2: Jane To [email protected],[email protected] by sender1 on 01-22-2014
# 3: Tim To [email protected] by sender2 on 05-11-2014
# myString_2
# 1: NA
# 2: To [email protected] by sender2 on 02-03-2014
# 3: To [email protected] by sender2 on 06-03-2015
# myString_3
# 1: NA
# 2: To [email protected] by sender1 on 06-21-2014
# 3: NA
You can also try stri_split_fixed
from the "stringi" package with the argument simplify = TRUE
(though with your sample data, this adds an extra empty column at the end). The approach would be something like:
library(stringi)
data.frame(person = df$person,
stri_split_fixed(df$myString, "\n",
simplify = TRUE))
Upvotes: 4
Reputation: 49448
This might suffice:
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, strsplit(myString, split = "\n"), by = person]
# person V1
#1: John To [email protected] by sender6 on 01-12-2014
#2: Jane To [email protected],[email protected] by sender1 on 01-22-2014
#3: Jane To [email protected] by sender2 on 02-03-2014
#4: Jane To [email protected] by sender1 on 06-21-2014
#5: Tim To [email protected] by sender2 on 05-11-2014
#6: Tim To [email protected] by sender2 on 06-03-2015
And then can trivially convert to wide format:
dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
person ~ idx, value.var = 'V1')
# person 1 2 3
#1: Jane To [email protected],[email protected] by sender1 on 01-22-2014 To [email protected] by sender2 on 02-03-2014 To [email protected] by sender1 on 06-21-2014
#2: John To [email protected] by sender6 on 01-12-2014 NA NA
#3: Tim To [email protected] by sender2 on 05-11-2014 To [email protected] by sender2 on 06-03-2015 NA
# (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
Upvotes: 1
Reputation: 1021
Here is an efficient route to a long form:
a <- strsplit(df$myString, "\n")
lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
longdf <- df[rep(seq_along(a), lens),]
longdf$string <- unlist(a)
Note that stack()
is often useful for these cases.
Can be simplified by using the IRanges Bioconductor package:
longdf <- df[togroup(a),]
longdf$string <- unlist(a)
Then, if really necessary, go to wide-form:
longdf$myString <- NULL
longdf$token <- sequence(lens)
widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")
Upvotes: 1
Reputation: 6659
Seems hacky, but here ya go...
Use strsplit to split the char vector. Get the max length, use that for your columns.
df <- data.frame(
person = c("John", "Jane", "Tim"),
myString = c("To [email protected] by sender6 on 01-12-2014\n",
"To [email protected],[email protected] by sender1 on 01-22-2014\nTo [email protected] by sender2 on 02-03-2014\nTo [email protected] by sender1 on 06-21-2014\n",
"To [email protected] by sender2 on 05-11-2014\nTo [email protected] by sender2 on 06-03-2015\n"
), stringsAsFactors=FALSE
)
a <- strsplit(df$myString, "\n")
max_len <- max(sapply(a, length))
for(i in 1:max_len){
df[,paste0("email", i)] <- sapply(a, "[", i)
}
Upvotes: 1