XPATH to use preceding and following sibling in a single statement

Question

I would like to scrape name, address informations between tag contains defendent text and another tag,

My HTML structure is:


Defendant/Respondent Information
(Each Defendant/Respondent is displayed below)


Party Type: DefendantParty No.:1




Name: Name 1




Address: Addr 1


City: city1State:aaZip Code:Zip1





Party Type: DefendantParty No.:2




Name: Name 2




Address: Addr2


City: City2State:st2Zip Code:zip2



Related Persons Information
(Each Related person is displayed below)


Name: Unwanted Name




Address: un addr


City: Unwanted CityState:Unwanted cityZip Code:12345

My current XPATH capturing the first occurence of Name and address properly, but if need to extract the multiple occurences, it also scrape the information from the unwanted h5 tags.

My current XPATH is,

"//*[contains(text(),'Defendant')]//following-sibling::table//span[text()='Name:' or text()='Business or Organization Name:']/ancestor-or-self::td/following-sibling::td//text()")

I tried including preceding sibling and following sibling but nothing gives my expected output,

My current output is..

names - [
Name1,
Name2
Unwanted Name,
]

Expected output is,

[
Name1
Name2

]

Kindly help.

Siebe Jongebloed · Accepted Answer

try this:

"//H5[contains(text(),'Defendant')]/following-sibling::table[not(preceding-sibling::H5[not(contains(text(),'Defendant'))])]/tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"

It first selects the table that has not a preceding-sibling::h5 with text() that not contains 'Defendant' and than selects from the correct table the tr where the first td meets your requirements and selects the second td

No need for double slashes which is bad for performance

EDIT 1

Since there are more preceding-sibling::h5 than the example shows, this XPath will deal with that:

"//H5[contains(text(),'Defendant')]/following-sibling::table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"

This will only select those tables that have as there first preceding-sibling::h5 the same h5 as we were interested in

EDIT 2

Actually now the first h5 select is redundant. This XPath will do:

"//table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"

XPATH to use preceding and following sibling in a single statement

Answers (1)

Related Questions