Reputation: 3802
I wanted Twitter tweets of user for data analysis. For that I have used HtmlAgilityPack package to scrape Twitter and it gives me 30 top tweets.
I recognized tweet-text element and fetched all tweets. But I want to identify if it is tweet or retweet. How can I do that?
I have analysed HTML. In retweet there will be an element having tweet-context with-icn
class. But when I scrape tweet on that class it throws null exception, because not all tweets will have that class. Then based on what and how can I scrape to get to know if it is retweet or not?
Code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://twitter.com/BarackObama");
var TweetsNode= doc.DocumentNode.SelectNodes("//tr[@class='tweet-container']").ToList();
foreach (var item in TweetsNode)
{
var tweet = new Tweets
{
console.WriteLine(item.innertext);
};
}
In the above code, I have tried to fetch tweets of Barack Obama profile. I'm getting top 30 tweets. How can I recognize which one is retweet?
Thank you.
Upvotes: 0
Views: 528
Reputation: 5150
Get all Tweets from a page (which comes in handy tables <table class='tweet '>
)
HtmlWeb p = new HtmlWeb();
var doc = p.Load(@"https://twitter.com/dailygametips");
var nodes = doc.DocumentNode.SelectNodes("//table[@class='tweet ']");
Look in nodes for the <span class='context'>
to indicated that this tweet is a retweet.
List<Tweet> tweets = new List<Tweet>();
foreach (var node in nodes)
{
bool isRetweet = false;
var spanNode = node.SelectSingleNode(".//span[@class='context']");
if (spanNode != null && spanNode.InnerHtml.Contains("retweeted"))
{
isRetweet = true;
}
We also want the Message Text, so scrap this next <div class='tweet-text'>
:
string msg = string.Empty;
var msgNode = node.SelectSingleNode(".//div[@class='tweet-text']");
if (msgNode != null)
{
msg = msgNode.InnerText.Trim();
}
tweets.Add(new Tweet(msg, isRetweet));
}
Additional the Tweet Container Class:
class Tweet
{
public Tweet(string message, bool isRetweet)
{
Message = message;
IsRetweet = isRetweet;
}
string Message { get; private set; }
bool IsRetweet { get; private set; }
}
As you tell, this is not really rocket science. But you need to understand the basic principals of XPath and Scrapping.
Upvotes: 1