Reputation: 67
I have a Twitter dataset of 30000 tweets and I'm trying to prepare the data for text analysis. I downloaded the dataset with academictwitteR package in R. Inside the dataset, some columns (such as; "user.metrics", "public.metrics", "entities" are seperate data frames. I managed to extract the columns from "user.metrics" and "public.metrics" and merge the extracted columns with my original dataset as following, without a problem;
#extract
extract_publicmetrics <- as.data.frame(mytwitterdata$public_metrics)
colnames(extract_publicmetrics)
[1] "retweet_count" "reply_count" "like_count" "quote_count"
#add observation column to bind with the original data (mytwitterdata)
addconsecutivenumbers1 <- cbind(extract_publicmetrics, "observation"=1:nrow(deneme2_publicmetrics))
addconsecutivenumbers2 <- cbind(mytwitterdata, "observation"=1:nrow(joined_deneme2))
#merge two data
merged.data <- merge(addconsecutivenumbers1, addconsecutivenumbers2, by="observation")
But, I could not manage to extract "mentions", "urls", "hastags" columns from "Entities" dataframe in my dataset.I think it's because "mentions", "urls", "hashtags" are nested lists in that data frame (e.g.):
class(mytwitterdata$entities$hashtags)
[1] "list"
For example, a Tweet may contain no hashtag, one hashtag, or more than one hashtag. I want to create a new column from that list in which the value of the row is NA when there is no hashtag, or the row includes the hashtag as text in the row ( or hashtags separated with commas when it includes more than one hashtag).
Attached is s sample data of 10 rows extracted from the "Entities" dataframe from my dataset:
https://drive.google.com/file/d/1vfyFIObRS9tCxGNJCG9AMyKgxwgwBDMZ/view?usp=sharing
Upvotes: 2
Views: 609
Reputation: 67
I finally solved the problem with hoist() function from tidyr package. It plucks out selected components.
mytwitterdata$entities$hashtags includes 3 components ('start', 'end', and 'tag'). If there is hashtag in the tweet, the hashtag(s) is listed in the tag column. I wanted to pluck out that column and merge it to my original Twitter dataset. This allowed me to create column with the hashtags. I also converted the NULL rows to NA:
hashtag.column <- mytwitterdata %>%
select(hashtags) %>%
hoist(hashtags, hashtag = 3)
hashtag.column[ hashtag.column == "NULL"] = NA
Upvotes: 1