Reputation: 118271
I am working on 3rd Party application where I can only view to the Webpage source content.And from there I have to collect only some href
content values which has pattern like /aems/file/filegetrevision.do?fileEntityId
. Is it possible?
HTML *(Part of HTML)*
<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>
Upvotes: 0
Views: 89
Reputation: 3742
require 'open-uri'
source='http://www.example.com'
page = open(source).read
URI.extract(page,/.*\/aems\/file\/filegetrevision.do?fileEntityId=.*/)
Upvotes: 1
Reputation: 160551
Easily:
require 'nokogiri'
html = '
<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>
'
doc = Nokogiri::HTML(html)
doc.search('a[href]').map{ |a| a['href'] }
Which returns:
[
[0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]
If you want to filter for path matches, use something like:
pattern = Regexp.escape('/aems/file/filegetrevision.do?fileEntityId')
doc.search('a[href]').map{ |a| a['href'] }.select{ |href| href[ %r[^#{ pattern }] ] }
Which, again, returns:
[
[0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]
This code will return the href
parameter from all <a>
tags with href
in the document. In the second example, it will filter them by the path.
Upvotes: 2