Naim Cinar
Naim Cinar

Reputation: 67

Extracting mentions, hashtags and, urls and placing them in a new column in a Twitter Dataset with R

I have a Twitter dataset of 30000 tweets and I'm trying to prepare the data for text analysis. I downloaded the dataset with academictwitteR package in R. Inside the dataset, some columns (such as; "user.metrics", "public.metrics", "entities" are seperate data frames. I managed to extract the columns from "user.metrics" and "public.metrics" and merge the extracted columns with my original dataset as following, without a problem;

#extract
extract_publicmetrics <- as.data.frame(mytwitterdata$public_metrics)
colnames(extract_publicmetrics)
[1] "retweet_count" "reply_count"   "like_count"    "quote_count"

#add observation column to bind with the original data (mytwitterdata)
addconsecutivenumbers1 <- cbind(extract_publicmetrics, "observation"=1:nrow(deneme2_publicmetrics)) 
addconsecutivenumbers2 <- cbind(mytwitterdata, "observation"=1:nrow(joined_deneme2))
#merge two data
merged.data <- merge(addconsecutivenumbers1, addconsecutivenumbers2, by="observation")

But, I could not manage to extract "mentions", "urls", "hastags" columns from "Entities" dataframe in my dataset.I think it's because "mentions", "urls", "hashtags" are nested lists in that data frame (e.g.):

class(mytwitterdata$entities$hashtags)
[1] "list"

For example, a Tweet may contain no hashtag, one hashtag, or more than one hashtag. I want to create a new column from that list in which the value of the row is NA when there is no hashtag, or the row includes the hashtag as text in the row ( or hashtags separated with commas when it includes more than one hashtag).

Attached is s sample data of 10 rows extracted from the "Entities" dataframe from my dataset:

https://drive.google.com/file/d/1vfyFIObRS9tCxGNJCG9AMyKgxwgwBDMZ/view?usp=sharing

Upvotes: 2

Views: 609

Answers (1)

Naim Cinar
Naim Cinar

Reputation: 67

I finally solved the problem with hoist() function from tidyr package. It plucks out selected components.

mytwitterdata$entities$hashtags includes 3 components ('start', 'end', and 'tag'). If there is hashtag in the tweet, the hashtag(s) is listed in the tag column. I wanted to pluck out that column and merge it to my original Twitter dataset. This allowed me to create column with the hashtags. I also converted the NULL rows to NA:

hashtag.column <- mytwitterdata %>%
  select(hashtags) %>%
  hoist(hashtags, hashtag = 3)

 hashtag.column[ hashtag.column == "NULL"] = NA

Upvotes: 1

Related Questions