Reputation: 12749
My data looks like the following:
ids <- c(1,2,3)
titles <- c("Entry1", "Entry2", "Entry3")
tags <- c("<self-help><motivation>", "<programming><r><data.frame>", "<photography>")
df <- data.frame(id = ids, title = titles, tags = tags)
df
Output:
id title tags
1 1 Entry1 <self-help><motivation>
2 2 Entry2 <programming><r><data.frame>
3 3 Entry3 <photography>
I'm struggling to extract out the tags into a new data frame. What I want is a second data frame that looks like this:
id tag
1 1 self-help
2 1 motivation
3 2 programming
4 2 r
5 2 data.frame
6 3 photography
I've managed to do something with lists using the following, but the resulting column seems to be nested somehow. My attempts to unlist()
either have no effect or produce too many entries in the resulting vector.
df$tags_list <- lapply(df$tags, function(x)strsplit(gsub("^.|.$", "", x), "><")[[1]])
My question is: can someone please help me generate this new data.frame
(or data.table
) given df
as shown here? Any words of wisdom on how to deal with nested lists in general are greatly appreciated as well.
Upvotes: 2
Views: 92
Reputation: 28441
With base R
version 3.2.0+
you could try:
lst <- strsplit(gsub("^<|>$", "", df$tags), split="><")
data.frame(id=rep(df$id, lengths(lst)), tags_list=unlist(lst))
# id tags_list
# 1 1 self-help
# 2 1 motivation
# 3 2 programming
# 4 2 r
# 5 2 data.frame
# 6 3 photography
With gsub("^<|>$"..)
we eliminate beginning and terminal brackets leaving the pattern "><"
as word separators. Then we create the data frame with the ids and the separated list.
Update
The function lengths
was added in R 3.2.0 and does for lists what length
similarly does for vectors. Before it was required to use sapply(lst, length)
.
Upvotes: 9
Reputation: 887183
Another option using stri_extract_all
from library(stringi)
with melt
library(stringi)
library(reshape2)
melt(setNames(stri_extract_all_regex(df$tags, '[^<>]+'), df$id))
Upvotes: 1
Reputation: 31171
You can use cSplit
from splitstackshape
:
library(splitstackshape)
cSplit(transform(df, tags=gsub('^<|>$','', tags)), 'tags', sep='><', direction='long')
# id title tags
#1: 1 Entry1 self-help
#2: 1 Entry1 motivation
#3: 2 Entry2 programming
#4: 2 Entry2 r
#5: 2 Entry2 data.frame
#6: 3 Entry3 photography
Upvotes: 3
Reputation: 20463
tidyr::unnest
to the rescue...
library(tidyr)
library(dplyr) # for %>%
df$tags2 <- strsplit(gsub("^.|.$", "", df$tags), "><")
df %>%
unnest(tags2)
# id title tags tags2
# 1 1 Entry1 <self-help><motivation> self-help
# 2 1 Entry1 <self-help><motivation> motivation
# 3 2 Entry2 <programming><r><data.frame> programming
# 4 2 Entry2 <programming><r><data.frame> r
# 5 2 Entry2 <programming><r><data.frame> data.frame
# 6 3 Entry3 <photography> photography
Upvotes: 4