How to get content of an HTML element using XPath without element id?

Question

I am trying to find an element using xpath and get the elements text value. Kindly bear with me and help me in resolving the issue.

Visit Click here

1

In

- I need to extract paragraphs text only up to “Further History” (ie. stop at “Further History”, not including “Further History”).

2.

In

- Here I need to extract paragraphs text after “Further History” (not including “Further History”).

I am using below XPath expression which is returning anything.

(//STRONG[not(contains(text(), 'Further History'))]/following-sibling::text() | //STRONG[not(contains(text(), 'Further History'))]/../following-sibling::p/text()) | //div[contains(@class, 'articlecontent')]

Mathias M&#252;ller · Accepted Answer

HTML might not be case-sensitive, but XML (and, consequently, XPath) is: "STRONG" is not the same as "strong", and in the HTML you linked to, there is only "strong".

A useful XPath expression to retrieve the text you are interested in might be

//div[@class="medium-8 columns"]/p[following-sibling::p/strong]/text()

which means

//div                           select all `div` elements, anywhere in the document
[@class="medium-8 columns"]     but only if they have a `class` attribute whose value is 
                                equal to "medium-8 columns"
/p                              of those `div` elements select all `p` child elements
[following-sibling::p/strong]   but only if they have a following sibling `p` which has a
                                `strong` element as a child
/text()                         of the remaining `p` elements, select the text content

and which would return (individual results separated by ------):

Tim Bajarin is recognized as one of the leading industry
consultants, analysts and futurists, covering the field of
personal computers and consumer technology. Mr. Bajarin has
been with Creative Strategies since 1981 and has served as a
consultant to most of the leading hardware and software
vendors in the industry including IBM, Apple, Xerox, Hewlett
Packard/Compaq, Dell, AT&T, Microsoft, Polaroid, Lotus,
Epson, Toshiba and numerous others.
-----------------------
His articles and/or analysis have appeared in USA Today, Wall
Street Journal, The New York Times, Time and Newsweek
magazines, BusinessWeek and most of the leading business and
trade publications. He has appeared as a business analyst
commenting on the computer industry on all of the major
television networks and was a frequent guest on PBS’ The
Computer Chronicles.
-----------------------
Mr. Bajarin has been a columnist for US computer industry
publications such as PC Week and Computer Reseller News and
wrote for ABCNEWS.COM for two years and Mobile Computing for
10 years. His columns currently appear in Asia Computer
Weekly, Personal Computer World (UK), and Microscope (UK) as
well as Mobile Enterprise Magazine. His various columns and
analyses are syndicated in over 30 countries.

For your second case:

Here I need to extract paragraphs text after “Further History” (not including “Further History”)

just replace following-sibling with preceding-sibling in the path expression.

How to get content of an HTML element using XPath without element id?

1

2.

Answers (1)

Related Questions