Reputation: 43
I want to extract the number of followers from https://www.instagram.com/bbcpersian/ and use the following codes to do this but it is not working properly.
var url = "https://www.instagram.com/bbcpersian/";
var web = new HtmlWeb();
var htmlDoc = web.Load(url);
var node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span");
string result = node.WriteContentTo();
Console.WriteLine(result);
OR
var html = @"https://www.instagram.com/bbcpersian/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerHtml + " - " + node.Attributes["title"].Value);
}
Upvotes: 1
Views: 132
Reputation: 954
Instagram pages are complicated. Your xpath "/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span"
doesn't work because that part of the DOM doesn't exist yet; in a web browser most of the DOM of an Instagram page is built up by a ton of JavaScript.
Note, though, that you do have this in the downloaded web page:
<meta content="6.3m Followers, 11 Following, 17.5k Posts - See Instagram photos and videos from BBC NEWS فارسی (@bbcpersian)" name="description" />
It's pretty easy to scrape this raw HTML with a regular expression:
Match m = Regex.Match(rawHTML, "\"(?<followers>.+?) Followers, (?<following>.+?) Following, (?<posts>.+?) Posts");
string result = m.Groups["followers"].Value;
Here is what your code would look like rewritten using this technique:
var url = "https://www.instagram.com/bbcpersian/";
var web = new HtmlWeb();
var htmlDoc = web.Load(url);
string rawHTML = htmlDoc.Text;
Match m = Regex.Match(rawHTML, "\"(?<followers>.+?) Followers, (?<following>.+?) Following, (?<posts>.+?) Posts");
string result = m.Groups["followers"].Value;
Upvotes: 0
Reputation: 637
Did you check the HTML structure in view source?
you actual html in the /html/body/div1 is as below. The content you see in page are loaded dynamically. Hence, those structures are not available in html document you are creating. You need to consider other option to do this.
<div id="react-root">
<span><svg width="50" height="50" viewBox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7"><path d="M25 1c-6.52 0-7.34.03-9.9.14-2.55.12-4.3.53-5.82 1.12a11.76 11.76 0 0 0-4.25 2.77 11.76 11.76 0 0 0-2.77 4.25c-.6 1.52-1 3.27-1.12 5.82C1.03 17.66 1 18.48 1 25c0 6.5.03 7.33.14 9.88.12 2.56.53 4.3 1.12 5.83a11.76 11.76 0 0 0 2.77 4.25 11.76 11.76 0 0 0 4.25 2.77c1.52.59 3.27 1 5.82 1.11 2.56.12 3.38.14 9.9.14 6.5 0 7.33-.02 9.88-.14 2.56-.12 4.3-.52 5.83-1.11a11.76 11.76 0 0 0 4.25-2.77 11.76 11.76 0 0 0 2.77-4.25c.59-1.53 1-3.27 1.11-5.83.12-2.55.14-3.37.14-9.89 0-6.51-.02-7.33-.14-9.89-.12-2.55-.52-4.3-1.11-5.82a11.76 11.76 0 0 0-2.77-4.25 11.76 11.76 0 0 0-4.25-2.77c-1.53-.6-3.27-1-5.83-1.12A170.2 170.2 0 0 0 25 1zm0 4.32c6.4 0 7.16.03 9.69.14 2.34.11 3.6.5 4.45.83 1.12.43 1.92.95 2.76 1.8a7.43 7.43 0 0 1 1.8 2.75c.32.85.72 2.12.82 4.46.12 2.53.14 3.29.14 9.7 0 6.4-.02 7.16-.14 9.69-.1 2.34-.5 3.6-.82 4.45a7.43 7.43 0 0 1-1.8 2.76 7.43 7.43 0 0 1-2.76 1.8c-.84.32-2.11.72-4.45.82-2.53.12-3.3.14-9.7.14-6.4 0-7.16-.02-9.7-.14-2.33-.1-3.6-.5-4.45-.82a7.43 7.43 0 0 1-2.76-1.8 7.43 7.43 0 0 1-1.8-2.76c-.32-.84-.71-2.11-.82-4.45a166.5 166.5 0 0 1-.14-9.7c0-6.4.03-7.16.14-9.7.11-2.33.5-3.6.83-4.45a7.43 7.43 0 0 1 1.8-2.76 7.43 7.43 0 0 1 2.75-1.8c.85-.32 2.12-.71 4.46-.82 2.53-.11 3.29-.14 9.7-.14zm0 7.35a12.32 12.32 0 1 0 0 24.64 12.32 12.32 0 0 0 0-24.64zM25 33a8 8 0 1 1 0-16 8 8 0 0 1 0 16zm15.68-20.8a2.88 2.88 0 1 0-5.76 0 2.88 2.88 0 0 0 5.76 0z"/></svg></span>
</div>
Upvotes: 1
Reputation: 481
I used Selenium to crowling a site and extract images like below, It may be useful for you:
IWebDriver _webDriver = null;
var firefoxOptions = new FirefoxOptions
{
LogLevel = FirefoxDriverLogLevel.Debug,
BrowserExecutableLocation = Configuration.Developer.SeleniumBrowserExecutableLocation
};
firefoxOptions.AddArguments("no-sandbox");
firefoxOptions.AddArguments("-headless");
_webDriver = new RemoteWebDriver(new Uri($"{Configuration.Developer.SeleniumRemoteUrl}"), firefoxOptions);
_webDriver.Manage().Window.Maximize();
_webDriver.Manage().Cookies.DeleteAllCookies();
_webDriver.Url = $"https://www.YourSite.com/";
_webDriver.Navigate();
var wait = new WebDriverWait(_webDriver, new TimeSpan(0, 0, 30));
var element = wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementIsVisible(By.ClassName("jumbo-hero")));
var imageContent = element.GetAttribute("innerHTML");
_webDriver.Quit();
var fromSrc = doc.DocumentNode.Descendants("img").Where(e => e.Attributes.Contains("src") && string.IsNullOrWhiteSpace(e.Attributes["src"].Value) == false).Select(e => e.Attributes["src"].Value).ToList();
var fromDataSrc = doc.DocumentNode.Descendants("img").Where(e => e.Attributes.Contains("data-src") && string.IsNullOrWhiteSpace(e.Attributes["data-src"].Value) == false).Select(e => e.Attributes["data-src"].Value).ToList();
Upvotes: 0
Reputation: 1
You can use Regular Expressions looking for the span where the followers are located.
/<a class="-nal3 " href="\/[a-zA-Z0-9]+\/followers\/"><span class="g47SY " title="([0-9.]+)">6,3mm<\/span>/m
Upvotes: 0