Reputation: 580
I have a dataset of hashtags used in tweets. Each row is a particular tweet and each variable is a different hashtag used in each tweet, so many variables are empty for some observations. because they have fewer hasthags. My ultimate objective is to see the co-occurrence of the 3 most popular hashtags, but for that I want to first see which tweets use these top3 hashtags.
My dataset looks something like this:
V1 | V2 | V3 | top3
nyc| | | nyc, cool, nyc2016
cool| nyc | | nyc, cool, nyc2016
hello| cool | nyc | nyc, cool, nyc2016
winter| nyc | | nyc, cool, nyc2016
So in this example the top 3 hashtags were nyc and cool, but not hello and winter.
I tried seeing if each hashtag was among top3 by doing
df1<-sapply(df$V1, function(x) grepl(sprintf('\\b%s\\b', x), df$top3))
But it is taking too long. And then I would have to do this for V2 and V3 (could do a loop, but that would take even longer to run).
Any suggestions?
Upvotes: 1
Views: 94
Reputation: 6020
I would always try to get my data in a normalized or long format, before doing such an operation. I feel my data is a lot more flexible that way. Although the solution mentioned in the comment probably works too, I like to share my solution:
library(dplyr)
library(tidyr)
df <- data.frame(v1 = c('nyc','cool','hello','winter')
,v2 = c(NA,'nyc','cool','nyc')
,v3 = c(NA,NA,'nyc',NA)
,stringsAsFactors = F)
top3 <- c('nyc','cool','nyc2016')
df %>% mutate(id = row_number()) %>% gather(n, word,-id) %>%
filter(!is.na(word)) %>% group_by(id) %>%
summarise(n_in_top3 = sum(ifelse(word %in% top3,1,0)))
results in:
id n_in_top3
(int) (dbl)
1 1
2 2
3 2
4 1
The result is a summary with a count how many words were in the top 3 word list, for each row in your data.
If you want it to have a TRUE/FALSE
value for each of the columns do the following:
df %>% mutate(id = row_number()) %>% gather(n, word,-id) %>%
filter(!is.na(word)) %>% group_by(id, n) %>%
summarise(n_in_top3 = (word %in% top3)) %>%
spread(n, n_in_top3)
which gives:
id v1 v2 v3
<int> <lgl> <lgl> <lgl>
1 TRUE NA NA
2 TRUE TRUE NA
3 FALSE TRUE TRUE
4 FALSE TRUE NA
Upvotes: 1
Reputation: 12839
Can we safely assume top3
is unique in your dataset? If so:
df <- read.table(
textConnection(" V1 | V2 | V3 | top3
nyc| | | nyc, cool, nyc2016
cool| nyc | | nyc, cool, nyc2016
hello| cool | nyc | nyc, cool, nyc2016
winter| nyc | | nyc, cool, nyc2016"),
sep = "|", header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE)
library(dplyr) ; library(stringr)
top <- str_split(df$top3[[1]], pattern = ", ")[[1]]
is_in_top <- function(x) x %in% top
mutate_each(df, funs(is_in_top), vars = V1:V3)
Upvotes: 3