user8298092
user8298092

Reputation:

Extracting particular text

I am trying to extract all links to videos on a particular WordPress website. Each page has only one video.

Inside each page crawled, there is the following code:

<p><script src="https://www.vooplayer.com/v3/watch/video.js"></script>
<iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer.com/v3/watch/watch.php?v=123456;clearVars=1" frameborder="0" scrolling="no" width="660" height="410" >
</iframe></p>

I would like to extract the text from here

Google Chrome Inspector tells me that this can be addressed as:

But each webpage I am crawling has a different "post" number. They are quite random, hence I cannot easily use the aforementioned selectors.

Upvotes: 0

Views: 47

Answers (1)

alecxe
alecxe

Reputation: 473763

If there is a dynamic part inside the id attribute, you can address it by partial-matching:

[id^=post] > div > p > iframe

where ^= means "starts with".

XPath alternative:

//*[starts-with(@id, "post")]/div/p/iframe

See also if you can avoid checking for div and p intermediate elements altogether and do:

[id^=post] iframe
//*[starts-with(@id, "post")]//iframe

You may additionally check for the iframe name as well:

[id^=post] iframe[name=vooplayerframe]
//*[starts-with(@id, "post")]//iframe[@name = "vooplayerframe"]

Upvotes: 2

Related Questions