Reputation:
I am trying to extract all links to videos on a particular WordPress website. Each page has only one video.
Inside each page crawled, there is the following code:
<p><script src="https://www.vooplayer.com/v3/watch/video.js"></script>
<iframe id="" voo-auto-adj="true" name="vooplayerframe" style="max-width:100%" allowtransparency="true" allowfullscreen="true" src="//www.vooplayer.com/v3/watch/watch.php?v=123456;clearVars=1" frameborder="0" scrolling="no" width="660" height="410" >
</iframe></p>
I would like to extract the text from here
Google Chrome Inspector tells me that this can be addressed as:
//*[@id="post-255"]/div/p/iframe
#post-255 > div > p > iframe
But each webpage I am crawling has a different "post" number. They are quite random, hence I cannot easily use the aforementioned selectors.
Upvotes: 0
Views: 47
Reputation: 473763
If there is a dynamic part inside the id
attribute, you can address it by partial-matching:
[id^=post] > div > p > iframe
where ^=
means "starts with".
XPath alternative:
//*[starts-with(@id, "post")]/div/p/iframe
See also if you can avoid checking for div
and p
intermediate elements altogether and do:
[id^=post] iframe
//*[starts-with(@id, "post")]//iframe
You may additionally check for the iframe name as well:
[id^=post] iframe[name=vooplayerframe]
//*[starts-with(@id, "post")]//iframe[@name = "vooplayerframe"]
Upvotes: 2