Reputation: 9397
I have got this new project that I am not familiar in working with. One task is that I need to navigate some websites to collect some data. One sample website would be this: https://www.hudhomestore.com/Home/Index.aspx
I have read and watched tutorials on "collecting" data from a web page, such as:
But my question is how do we usually set preferences, to "search" based on our preferences, and then use the above links to load the results in my code?
EDIT
This is correct for setting the searching criteria based on my selection. However, total count of the search (If I do it manually for MI state) is 223, but i I execute the below code, tdNodeCollection
is only 121. Can you show me where am I going wrong?
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage)));
HtmlNodeCollection tdNodeCollection = doc
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");
Upvotes: 2
Views: 1540
Reputation: 1796
You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
//string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
//Here you need to set the values of these variable to whatever user inputs
//after setting these values, add them to initial URL
string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
"zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
"&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
"&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
"&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
"&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
"&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
HtmlNodeCollection tdNodeCollection = document
.DocumentNode
.SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");
Count them again and look at your expression, there are exactly 121 td's
within tr
with id="dgPropertyList"
Next, check your td
manually and trace what you need from that td
and fetch that data.
foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
{
//Do you say you want to access to <h2>, <p> here?
//You can do:
HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too
//And you can also take a look at the children, without using XPath (like in a tree):
HtmlNode h2Node_ = node.ChildNodes["h2"];
}
I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.
Another option could be using Selenium webdriver
, Get your hands on Selenium
If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS
Hope it helps.
Upvotes: 2