Reputation: 2231
I have the following HTML structure which contain few email list, and I want grab email which email business, and not yahoo, gmail, hotmail, etc
<a href="#1">[email protected]</a>
<a href="#2">[email protected]</a>
<a href="#5">[email protected]</a>
<a href="#3">[email protected]</a>
<a href="#6">[email protected]</a>
<a href="#4">[email protected]</a>
So what I want is
[email protected]
[email protected]
My idea is
get A tag which NOT contain ymail AND NOT contain yahoo AND NOT contain gmail, AND NOT contain hotmail
But how can I write XPath syntax according to above idea ?
Upvotes: 1
Views: 4242
Reputation: 195962
You could use the substring-after
and substring-before
to get the part after the @ and before the first . combined with not
and contains
So substring-before(substring-after(text(),"@"),'.')
would get the first part of the domain and //a[not(contains("ymail yahoo gmail hotmail", ...))]
would exclude the ones you want.
Altogether
//a[not(contains("ymail yahoo gmail hotmail", substring-before(substring-after(text(),"@"),'.')))]
Upvotes: 3
Reputation: 111491
Your idea translates directly into XPath as follows:
//a[not(contains(., 'ymail')) and not(contains(., 'yahoo')) and not(contains(., 'gmail')) and not(contains(., 'hotmail'))]/text()
For your example (with a single root element added),
<html>
<a href="#1">[email protected]</a>
<a href="#2">[email protected]</a>
<a href="#5">[email protected]</a>
<a href="#3">[email protected]</a>
<a href="#6">[email protected]</a>
<a href="#4">[email protected]</a>
</html>
it selects
[email protected]
[email protected]
as requested.
Upvotes: 3