Find list of strings that match

Question

I have a dataset of hashtags used in tweets. Each row is a particular tweet and each variable is a different hashtag used in each tweet, so many variables are empty for some observations. because they have fewer hasthags. My ultimate objective is to see the co-occurrence of the 3 most popular hashtags, but for that I want to first see which tweets use these top3 hashtags.

My dataset looks something like this:

    V1 |  V2  |  V3 |      top3
    nyc|      |     | nyc, cool, nyc2016
   cool| nyc  |     | nyc, cool, nyc2016
  hello| cool | nyc | nyc, cool, nyc2016
 winter| nyc  |     | nyc, cool, nyc2016

So in this example the top 3 hashtags were nyc and cool, but not hello and winter.

I tried seeing if each hashtag was among top3 by doing

    df1<-sapply(df$V1, function(x) grepl(sprintf('\b%s\b', x), df$top3))

But it is taking too long. And then I would have to do this for V2 and V3 (could do a loop, but that would take even longer to run).

Any suggestions?

Aur&#232;le · Accepted Answer

Can we safely assume top3 is unique in your dataset? If so:

df <- read.table(
  textConnection("    V1 |  V2  |  V3 |      top3
    nyc|      |     | nyc, cool, nyc2016
   cool| nyc  |     | nyc, cool, nyc2016
  hello| cool | nyc | nyc, cool, nyc2016
 winter| nyc  |     | nyc, cool, nyc2016"),
  sep = "|", header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE)
library(dplyr) ; library(stringr)
top <- str_split(df$top3[[1]], pattern = ", ")[[1]]
is_in_top <- function(x) x %in% top
mutate_each(df, funs(is_in_top), vars = V1:V3)

Find list of strings that match

Answers (2)

Related Questions