Arup Rakshit
Arup Rakshit

Reputation: 118271

HREF values search through the web page using Ruby

I am working on 3rd Party application where I can only view to the Webpage source content.And from there I have to collect only some href content values which has pattern like /aems/file/filegetrevision.do?fileEntityId. Is it possible?

HTML *(Part of HTML)*

<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>

Upvotes: 0

Views: 89

Answers (2)

alex
alex

Reputation: 3742

require 'open-uri'
source='http://www.example.com'
page = open(source).read
URI.extract(page,/.*\/aems\/file\/filegetrevision.do?fileEntityId=.*/)

Upvotes: 1

the Tin Man
the Tin Man

Reputation: 160551

Easily:

require 'nokogiri'

html = '
<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>
'

doc = Nokogiri::HTML(html)
doc.search('a[href]').map{ |a| a['href'] }

Which returns:

[
    [0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]

If you want to filter for path matches, use something like:

pattern = Regexp.escape('/aems/file/filegetrevision.do?fileEntityId')
doc.search('a[href]').map{ |a| a['href'] }.select{ |href| href[ %r[^#{ pattern }] ] }

Which, again, returns:

[
  [0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]

This code will return the href parameter from all <a> tags with href in the document. In the second example, it will filter them by the path.

Upvotes: 2

Related Questions