dwurf
dwurf

Reputation: 12749

Extracting multiple values from a character field into a new data table

My data looks like the following:

ids <- c(1,2,3)
titles <- c("Entry1", "Entry2", "Entry3")
tags <- c("<self-help><motivation>", "<programming><r><data.frame>", "<photography>")
df <- data.frame(id = ids, title = titles, tags = tags)
df

Output:

   id  title                         tags
1   1 Entry1      <self-help><motivation>
2   2 Entry2 <programming><r><data.frame>
3   3 Entry3                <photography>

I'm struggling to extract out the tags into a new data frame. What I want is a second data frame that looks like this:

  id         tag
1  1   self-help
2  1  motivation
3  2 programming
4  2           r
5  2  data.frame
6  3 photography

I've managed to do something with lists using the following, but the resulting column seems to be nested somehow. My attempts to unlist() either have no effect or produce too many entries in the resulting vector.

df$tags_list <- lapply(df$tags, function(x)strsplit(gsub("^.|.$", "", x), "><")[[1]])

My question is: can someone please help me generate this new data.frame (or data.table) given df as shown here? Any words of wisdom on how to deal with nested lists in general are greatly appreciated as well.

Upvotes: 2

Views: 92

Answers (4)

Pierre L
Pierre L

Reputation: 28441

With base R version 3.2.0+ you could try:

lst <- strsplit(gsub("^<|>$", "", df$tags), split="><")
data.frame(id=rep(df$id, lengths(lst)), tags_list=unlist(lst))
#   id   tags_list
# 1  1   self-help
# 2  1  motivation
# 3  2 programming
# 4  2           r
# 5  2  data.frame
# 6  3 photography

With gsub("^<|>$"..) we eliminate beginning and terminal brackets leaving the pattern "><" as word separators. Then we create the data frame with the ids and the separated list.

Update

The function lengths was added in R 3.2.0 and does for lists what length similarly does for vectors. Before it was required to use sapply(lst, length).

Upvotes: 9

akrun
akrun

Reputation: 887183

Another option using stri_extract_all from library(stringi) with melt

library(stringi)
library(reshape2)
melt(setNames(stri_extract_all_regex(df$tags, '[^<>]+'), df$id))

Upvotes: 1

Colonel Beauvel
Colonel Beauvel

Reputation: 31171

You can use cSplit from splitstackshape:

library(splitstackshape)

cSplit(transform(df, tags=gsub('^<|>$','', tags)), 'tags', sep='><', direction='long')
#   id  title        tags
#1:  1 Entry1   self-help
#2:  1 Entry1  motivation
#3:  2 Entry2 programming
#4:  2 Entry2           r
#5:  2 Entry2  data.frame
#6:  3 Entry3 photography

Upvotes: 3

JasonAizkalns
JasonAizkalns

Reputation: 20463

tidyr::unnest to the rescue...

library(tidyr)
library(dplyr) # for %>%

df$tags2 <- strsplit(gsub("^.|.$", "", df$tags), "><")

df %>%
  unnest(tags2)

#   id  title                         tags       tags2
# 1  1 Entry1      <self-help><motivation>   self-help
# 2  1 Entry1      <self-help><motivation>  motivation
# 3  2 Entry2 <programming><r><data.frame> programming
# 4  2 Entry2 <programming><r><data.frame>           r
# 5  2 Entry2 <programming><r><data.frame>  data.frame
# 6  3 Entry3                <photography> photography

Upvotes: 4

Related Questions