Dark Cyber
Dark Cyber

Reputation: 2231

XPath for all element text that does not contain certain values

I have the following HTML structure which contain few email list, and I want grab email which email business, and not yahoo, gmail, hotmail, etc

<a href="#1">[email protected]</a>
<a href="#2">[email protected]</a>
<a href="#5">[email protected]</a>
<a href="#3">[email protected]</a>
<a href="#6">[email protected]</a>
<a href="#4">[email protected]</a>

So what I want is

[email protected]
[email protected]

My idea is

get A tag which NOT contain ymail AND NOT contain yahoo AND NOT contain gmail, AND NOT contain hotmail

But how can I write XPath syntax according to above idea ?

Upvotes: 1

Views: 4242

Answers (2)

Gabriele Petrioli
Gabriele Petrioli

Reputation: 195962

You could use the substring-after and substring-before to get the part after the @ and before the first . combined with not and contains

So substring-before(substring-after(text(),"@"),'.') would get the first part of the domain and //a[not(contains("ymail yahoo gmail hotmail", ...))] would exclude the ones you want.

Altogether

//a[not(contains("ymail yahoo gmail hotmail", substring-before(substring-after(text(),"@"),'.')))]

Upvotes: 3

kjhughes
kjhughes

Reputation: 111491

Your idea translates directly into XPath as follows:

//a[not(contains(., 'ymail')) and not(contains(., 'yahoo')) and not(contains(., 'gmail')) and not(contains(., 'hotmail'))]/text()

For your example (with a single root element added),

<html>
 <a href="#1">[email protected]</a>
 <a href="#2">[email protected]</a>
 <a href="#5">[email protected]</a>
 <a href="#3">[email protected]</a>
 <a href="#6">[email protected]</a>
 <a href="#4">[email protected]</a>
</html>

it selects

[email protected]
[email protected]

as requested.

Upvotes: 3

Related Questions