Agustín Indaco
Agustín Indaco

Reputation: 580

Find list of strings that match

I have a dataset of hashtags used in tweets. Each row is a particular tweet and each variable is a different hashtag used in each tweet, so many variables are empty for some observations. because they have fewer hasthags. My ultimate objective is to see the co-occurrence of the 3 most popular hashtags, but for that I want to first see which tweets use these top3 hashtags.

My dataset looks something like this:

    V1 |  V2  |  V3 |      top3
    nyc|      |     | nyc, cool, nyc2016
   cool| nyc  |     | nyc, cool, nyc2016
  hello| cool | nyc | nyc, cool, nyc2016
 winter| nyc  |     | nyc, cool, nyc2016

So in this example the top 3 hashtags were nyc and cool, but not hello and winter.

I tried seeing if each hashtag was among top3 by doing

    df1<-sapply(df$V1, function(x) grepl(sprintf('\\b%s\\b', x), df$top3))

But it is taking too long. And then I would have to do this for V2 and V3 (could do a loop, but that would take even longer to run).

Any suggestions?

Upvotes: 1

Views: 94

Answers (2)

Wietze314
Wietze314

Reputation: 6020

I would always try to get my data in a normalized or long format, before doing such an operation. I feel my data is a lot more flexible that way. Although the solution mentioned in the comment probably works too, I like to share my solution:

library(dplyr)
library(tidyr)


df <- data.frame(v1 = c('nyc','cool','hello','winter')
                 ,v2 = c(NA,'nyc','cool','nyc')
                 ,v3 = c(NA,NA,'nyc',NA)
                 ,stringsAsFactors = F)
top3 <- c('nyc','cool','nyc2016')

df %>% mutate(id = row_number()) %>% gather(n, word,-id) %>% 
  filter(!is.na(word)) %>% group_by(id) %>%
  summarise(n_in_top3 = sum(ifelse(word %in% top3,1,0)))

results in:

id        n_in_top3
(int)     (dbl)
1         1
2         2
3         2
4         1

The result is a summary with a count how many words were in the top 3 word list, for each row in your data.

If you want it to have a TRUE/FALSE value for each of the columns do the following:

df %>% mutate(id = row_number()) %>% gather(n, word,-id) %>% 
  filter(!is.na(word)) %>% group_by(id, n) %>%
  summarise(n_in_top3 = (word %in% top3)) %>%
  spread(n, n_in_top3)

which gives:

id    v1      v2     v3
<int> <lgl>   <lgl>  <lgl>
1     TRUE    NA     NA
2     TRUE    TRUE   NA
3     FALSE   TRUE   TRUE
4     FALSE   TRUE   NA

Upvotes: 1

Aur&#232;le
Aur&#232;le

Reputation: 12839

Can we safely assume top3 is unique in your dataset? If so:

df <- read.table(
  textConnection("    V1 |  V2  |  V3 |      top3
    nyc|      |     | nyc, cool, nyc2016
   cool| nyc  |     | nyc, cool, nyc2016
  hello| cool | nyc | nyc, cool, nyc2016
 winter| nyc  |     | nyc, cool, nyc2016"),
  sep = "|", header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE)
library(dplyr) ; library(stringr)
top <- str_split(df$top3[[1]], pattern = ", ")[[1]]
is_in_top <- function(x) x %in% top
mutate_each(df, funs(is_in_top), vars = V1:V3)

Upvotes: 3

Related Questions