Reputation: 3910
I have a html file that contains many of the following code blocks:
<div class="f-icon m-item " data-ctrdot="60055294621">
<div class="item-main util-clearfix">
<div class="content">
<div class="cwrap">
<div class="cleft">
<div class="lwrap">
<h2 class="title"><a href="http://www.alibaba.com/product-detail/Sunnytex-Best-Selling-wind-proof-Soft_60055294621.html?s=p" title="Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat" data-hislog="60055294621" data-pid="60055294621" data-domdot="id:2678,pid:60055294621,ext:'|n=2|s=p|t={{attr target}}'" target="_blank" data-p4plog="60055294621">Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat</a> </h2>
<div class="attr">
US $23.5-24.8 /
<em>Piece</em>
<em>( FOB Price)</em>
</div>
<div class="attr">
500 Pieces
<em>(Min. Order)</em>
</div>
<div class="kv-prop util-clearfix">
<div class="kv" title="Product Type: Coats">
Product Type:
<b>Coats</b>
</div>
<div class="kv" title="Age Group: Adults">
Age Group:
<b>Adults</b>
</div>
.... (many other stuff not shown here)
</div>
</div>
</div>
</div> (end)
I want to extract all the links like "http://www.alibaba.com/product-detail/Custom-3D-Made-Printed-Blank-Hoodies_60081368914.html?s=p"
.
I wrote:
Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon m-item").select("h2.title").select("a[href]");
for(Element prodUrl: products){
System.out.println(prodUrl.html());
itemUrls.addItem(prodUrl.html());
}
So basically I want to put all the product page urls into a hashset called itemUrls, but it seems that there's nothing in products
. Jsoup.connect(catUrl).get()
works fine and can return the web page to me, but the select
method doesn't seem to work. Any input will be greatly appreciated. Thanks.
Upvotes: 0
Views: 241
Reputation: 124225
Spaces are used to describe ancestor child
relationship, so div.f-icon m-item
would represent div
with f-icon
class, and it would try to find m-item
element in it.
In other words doc.select("div.f-icon m-item")
is same as doc.select("div.f-icon").select("m-item")
which can find only something like
<div class="f-icon">
...
<m-item>...</m-item>
...
</div>
which is not what you want.
If you want to select element with two classes use element.class1.class2
syntax.
So instead of
doc.select("div.f-icon m-item").select("h2.title").select("a[href]")
you can write it as
doc.select("div.f-icon.m-item h2.title a[href]")
// ^^^^^^^^^^^^^^^^^ div with two classes "f-icon" and "m-item"
Next thing is that prodUrl.html()
will return you text which is used as representation of link like foo
in <a href="google.com">
foo.
What you seem to want is value of href
attribute. To do this use prodUrl.attr("href")
.
So your code can look more or less like
Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon.m-item h2.title a[href]");
for(Element prodUrl: products){
System.out.println(prodUrl.attr("href"));
itemUrls.addItem(prodUrl.attr("href"));
}
Upvotes: 1