HREF values search through the web page using Ruby

Question

I am working on 3rd Party application where I can only view to the Webpage source content.And from there I have to collect only some href content values which has pattern like /aems/file/filegetrevision.do?fileEntityId. Is it possible?

HTML *(Part of HTML)*



screenshot.doc

the Tin Man · Accepted Answer

Easily:

require 'nokogiri'

html = '


screenshot.doc


'

doc = Nokogiri::HTML(html)
doc.search('a[href]').map{ |a| a['href'] }

Which returns:

[
    [0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]

If you want to filter for path matches, use something like:

pattern = Regexp.escape('/aems/file/filegetrevision.do?fileEntityId')
doc.search('a[href]').map{ |a| a['href'] }.select{ |href| href[ %r[^#{ pattern }] ] }

Which, again, returns:

[
  [0] "/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz"
]

This code will return the href parameter from all tags with href in the document. In the second example, it will filter them by the path.

HREF values search through the web page using Ruby

Answers (2)

Related Questions