Khalil Khalaf
Khalil Khalaf

Reputation: 9397

How to extract data from a website with specifying a search criteria?

I have got this new project that I am not familiar in working with. One task is that I need to navigate some websites to collect some data. One sample website would be this: https://www.hudhomestore.com/Home/Index.aspx

enter image description here

I have read and watched tutorials on "collecting" data from a web page, such as:

But my question is how do we usually set preferences, to "search" based on our preferences, and then use the above links to load the results in my code?

EDIT

This is correct for setting the searching criteria based on my selection. However, total count of the search (If I do it manually for MI state) is 223, but i I execute the below code, tdNodeCollection is only 121. Can you show me where am I going wrong?

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "",
           bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
           stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";

    var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
        "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
        "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
        "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
        "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
        "&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
        "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage)));

    HtmlNodeCollection tdNodeCollection = doc
                             .DocumentNode
                             .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

Upvotes: 2

Views: 1540

Answers (1)

M. Adeel Khalid
M. Adeel Khalid

Reputation: 1796

You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlWeb web = new HtmlWeb();
        //string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
        //Here you need to set the values of these variable to whatever user inputs
        //after setting these values, add them to initial URL
        string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
               bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
               stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
        HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
            "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState + 
            "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
            "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath + 
            "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities + 
            "&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories + 
            "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
        HtmlNodeCollection tdNodeCollection = document
                                 .DocumentNode
                                 .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

Count them again and look at your expression, there are exactly 121 td's within tr with id="dgPropertyList" Next, check your td manually and trace what you need from that td and fetch that data.

            foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
            {
                //Do you say you want to access to <h2>, <p> here?
                //You can do:
                HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
                HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too

                //And you can also take a look at the children, without using XPath (like in a tree):        
                HtmlNode h2Node_ = node.ChildNodes["h2"];
            }

I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.

Another option could be using Selenium webdriver, Get your hands on Selenium

If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS

Hope it helps.

Upvotes: 2

Related Questions