Reputation: 153
I have a data frame including tweets, creation date, tweet ids, favorite and retweet counts. I want to create a corpus that includes for each document the favorite and retweet counts as variables. I also want to identify the documents by the tweet id, not by the random doc 001 etc ids.
I start with the data below... See below for rest of code
id
1: 737243856144629760
2: 737242308261842945
3: 737242189055594496
4: 737242018687164416
5: 737241411465170944
6: 737239685295181824
text
1: Have a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN!
2: "@NBCDFW: Trump rallies veterans at annual Rolling Thunder Gathering https://twitter.com/b08FcMlgkr https://twitter.com/RCDeLvHQqD"
3: "@FrankyLamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4: "@MariaErnandez3b: Trump Supports Rolling Thunder Rally #TRUMP STRONG https://twitter.com/pfVXQ8NdZu" So true, and remember the M.I.A.'s!
5: "@ScottWRasmussen: Donald Trump and Bikers Share Affection at Rolling Thunder Rally https://twitter.com/ZZl2sc29dn" A great day in D.C.!
6: "@TeaPartyNevada: #Trump2016 "Illegals are taken care of better than our veterans." https://twitter.com/KKIgM4rNma https://twitter.com/1cEZ8wG7Cy"
favorited favoritwitter.comunt replyToSN created truncated replyToSID replyToUID
1: FALSE 25944 NA 2016-05-30 11:26:47 FALSE NA NA
2: FALSE 9268 NA 2016-05-30 11:20:38 FALSE NA NA
3: FALSE 6739 NA 2016-05-30 11:20:09 FALSE NA NA
4: FALSE 15417 NA 2016-05-30 11:19:29 FALSE NA NA
5: FALSE 7192 NA 2016-05-30 11:17:04 FALSE NA NA
6: FALSE 9834 NA 2016-05-30 11:10:12 FALSE NA NA
statusSource screenName retweetCount
1: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 9455
2: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 2744
3: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 1604
4: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 4237
5: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 2148
6: <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> realDonaldTrump 3545
isRetweet retweeted longitude latitude
1: FALSE FALSE NA NA
2: FALSE FALSE NA NA
3: FALSE FALSE NA NA
4: FALSE FALSE NA NA
5: FALSE FALSE NA NA
6: FALSE FALSE NA NA
cleantxt
1: have a great memorial day and remember that we will soon make america great again!
2: "@nbcdfw: trump rallies veterans at annual rolling thunder gathering https://twitter.com/b08fcmlgkr https://twitter.com/rcdelvhqqd"
3: "@frankylamouche: how many of donald's rolling thunder brigade will sign up and go to war for him in the middle east."
4: "@mariaernandez3b: trump supports rolling thunder rally #trump strong https://twitter.com/pfvxq8ndzu" so true, and remember the m.i.a.'s!
5: "@scottwrasmussen: donald trump and bikers share affection at rolling thunder rally https://twitter.com/zzl2sc29dn" a great day in d.c.!
6: "@teapartynevada: #trump2016 "illegals are taken care of better than our veterans." https://twitter.com/kkigm4rnma https://twitter.com/1cez8wg7cy"
I try to convert it to a corpus with
myReader <- readTabular(mapping=list(content="cleantxt", id="id", created="created", retweet="retweetCount", fav="favoriteCount"))
trumptweetsenhanced <- VCorpus(DataframeSource(trumptweets.df), readerControl=list(reader=myReader))
However, when I convert the corpus back to a data frame, there are no added variables
> head(trumptweetsenhanced_dataframe.df)
docs text
1 doc 0001 great memori day rememb will soon make america great
2 doc 0002 nbcdfw trump ralli veteran annual roll thunder gather
3 doc 0003 frankylamouch mani donald roll thunder brigad will sign go war middl east
4 doc 0004 mariaernandezb trump support roll thunder ralli trump strong true rememb ms
5 doc 0005 scottwrasmussen donald trump biker share affect roll thunder ralli great day dc
6 doc 0006 teapartynevada trump illeg taken care better veteran
Upvotes: 0
Views: 372
Reputation: 9303
You can add metadata to your tweets-corpus with the tm::meta()
function. See library(tm); example(meta)
.
This metadata-annotation can happen on a per-corpus level- you might want to store "common" metadata such as the date when the tweets in this corpus were harvested, or the search query string, API call details, or whatever.
Annotation can also happen on a per-document level (in this case, on a per-tweet level)- you can store inside the corpus the tweet-attributes from your trumptweets.df data frame such as retweet-count, fav-count, language etc.
This implies clever and careful housekeeping. You typically use a set of custom functions together with the *apply-family of functions for calling meta() in a reading and writing manner. (Or use purrr::walk*, or purrr::map*)
I'm writing this off the top of my head. It's been a while since I worked with meta(). Maybe there is a completely different way (use nested data frames, or use other text-mining packages).
Upvotes: 1