Jacob Eriksson
Jacob Eriksson

Reputation: 1

C# Webscraper to grab amount of Google Results given a specific search term

I've been working on a webscraper as a Windows Forms application in C#. The user enter a search term and the term and the program will then split the search string for each individual words and look up the amount of search results through Yahoo and Google.

My issue lies with the orientation of the huge HTML document. I've tried multiple approaches such as iterating recursively and comparing ids aswell as with lamba and the Where statements. Both results in null. I also manually looked into the html document to make sure the id of the div I want exist in the document.

The id I'm looking for is "resultStats" but it is suuuuuper nested. My code looks like this:

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace WebScraper2._0
{
    public class Webscraper
    {
        private string Google = "http://google.com/#q=";
        private string Yahoo = "http://search.yahoo.com/search?p=";
        private HtmlWeb web = new HtmlWeb();

        private HtmlDocument GoogleDoc = new HtmlDocument();
        private HtmlDocument YahooDoc = new HtmlDocument();

        public Webscraper()
        {
            Console.WriteLine("Init");
        }

        public int WebScrape(string searchterms)
        {
            //Console.WriteLine(searchterms);
            string[] ssize = searchterms.Split(new char[0]);

            int YahooMatches = 0;
            int GoogleMatches = 0;

            foreach (var term in ssize)
            {
                //Console.WriteLine(term);
                var y = web.Load(Yahoo + term);
                var g = web.Load(Google + term + "&cad=h");
                YahooMatches += YahooFilter(y);
                GoogleMatches += GoogleFilter(g);

            }

            Console.WriteLine("Yahoo found " + YahooMatches.ToString() + " matches");
            Console.WriteLine("Google found " + GoogleMatches.ToString() + " matches");

            return YahooMatches + GoogleMatches;
        }

        //Parse to get correct info
        public int YahooFilter(HtmlDocument doc)
        {
            //Look for node with correct ID
            IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link"));

            foreach (var item in nodes)
            {
                // displaying final output
                Console.WriteLine(item.InnerText);
            }
            //TODO: Return search resultamount.
            return 0;
        }




        int testCounter = 0;
        string toReturn = "";
        bool foundMatch = false;

        //Parse to get correct info
        public int GoogleFilter(HtmlDocument doc)
        {
            if (doc == null)
            {
                Console.WriteLine("Null");
            }

            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                toReturn += Looper(node, testCounter, toReturn, foundMatch);
            }

            Console.WriteLine(toReturn);

            /*
            var stuff = doc.DocumentNode.Descendants("div")
              .Where(node => node.GetAttributeValue("id", "")
              .Equals("extabar")).ToList();

            IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("appbar"));
            */

            return 0;
        }


        public string Looper(HtmlNode node, int counter, string returnstring, bool foundMatch)
        {
            Console.WriteLine("Loop started" + counter.ToString());
            counter++;
            Console.WriteLine(node.Id);

            if (node.Id == "resultStats")
            {
                returnstring += node.InnerText;

            }

            foreach (HtmlNode n in node.Descendants())
            {
                Looper(n, counter, returnstring, foundMatch);
            }

            return returnstring;

        }
    }
}


Upvotes: 0

Views: 224

Answers (1)

Uriel Andreazza
Uriel Andreazza

Reputation: 56

I made an google HTML Scraper a few weeks ago, a few things to consider

First: Google don't like when you try to Scrape their Search HTML, while i was running a list of companies trying to get their addresses and phone number, Google block my IP from accessing their website for a little bit (Which cause a hilarious panic in the office)

Second: Google will change the HTML (Id names and etc) of the page so using ID's won't work, on my case i used the combination of HTML Tags and specific information to parse the response and extract the information that i wanted.

Third: It's better to just use their API to grab the information you need, just make sure you respect their free tier query limit and you should be golden.

Here is the Code i used.

    public static string getBetween(string strSource, string strStart, string strEnd)
    {
        int Start, End;
        if (strSource.Contains(strStart) && strSource.Contains(strEnd))
        {
            Start = strSource.IndexOf(strStart, 0) + strStart.Length;
            End = strSource.IndexOf(strEnd, Start);
            return strSource.Substring(Start, End - Start);
        }
        else
        {
            return "";
        }
    }

        public void SearchResult()
    {

        //Run a Google Search

        string uriString = "http://www.google.com/search";
        string keywordString = "Search String";

        WebClient webClient = new WebClient();

        NameValueCollection nameValueCollection = new NameValueCollection();
        nameValueCollection.Add("q", keywordString);

        webClient.QueryString.Add(nameValueCollection);
        
        string result = webClient.DownloadString(uriString);
        string search = getBetween(result, "Address", "Hours");
        rtbHtml.Text = getBetween(search, "\">", "<"); 
    } 

On my case i used the String Address and Hours to limit what information i wanted to extract.

Edit: Fixed the Logic and added the Code i used. Edit2: forgot to add the GetBetween Class. (sorry it's my first Answer)

Upvotes: 2

Related Questions