Makx
Makx

Reputation: 335

Getting a list of all substrings in a data frame column in R

I have a data frame in R, let's call it data.

One of the columns, data$tags contains strings. Each string is a comma separated list of tags (or categories that this entry relates to).

I'm trying to get a list of all available tags in the data frame.

I thought I could use one of the apply functions to run the column over the strsplit function and get one long concatenated vector with all string parts, then use unique to get rid of the duplicates.

I tried:

func_split_tags <- function(e) {
  return(unlist(strsplit(e," ")))
}
all_tags <- sapply(as.vector(data$tags), func_split_tags)

but that just gives me a list of the split-string vectors.

Does anyone have any idea how to make this work?

Thanks!

Upvotes: 0

Views: 799

Answers (2)

akrun
akrun

Reputation: 887078

We could do this with str_extract

library(stringr)
unlist(str_extract_all(df$s, "\\w+"))

Upvotes: 0

Gopala
Gopala

Reputation: 10483

Something like this is what you are looking for?

df <- data.frame(x = seq(1:10), s = 'I am in the city', stringsAsFactors = FALSE)
as.character(unlist(sapply(df$s, function(x) strsplit(x, ' '))))

You could write that last line as if you don't want anything more than a simple strsplit:

unlist(strsplit(df$s, ' '))

Upvotes: 2

Related Questions