Reputation:
I would like help in parsing text in Ruby.
Given:
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3
I would like to eliminate all the hyperlinks, returning plain text.
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands
Upvotes: 2
Views: 1373
Reputation: 160551
This is an old, but good, question. Here's an answer that relies on Ruby's built-in URI:
require 'set'
require 'uri'
text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
URI.extract(text).each do |url|
text.gsub!(url, '') if (url[schemes_regex])
end
puts text.squeeze(' ')
And a pass through IRB showing what's happening and the resulting output:
I defined the text to search:
irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
=> "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
I defined a regex of URI schemes we want to react to. This is a defensive move because URI returns a false-positive in its search step:
irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
=> /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i
Let URI walk through the text finding URLs. For each one found, if it's a scheme we want to react to, strip all its occurrences from the text:
irb(main):008:0* URI.extract(text).each do |url|
irb(main):009:1* text.gsub!(url, '') if (url[schemes_regex])
irb(main):010:1> end
These are the URLs URI.extract
found. It erroneously reports BreakingNews:
because of the trailing :
. I think it's not too sophisticated, but for normal use it's fine:
=> ["BreakingNews:", "http://news.bnonews.com/u4z3"]
Show what the resulting text was:
irb(main):012:0* puts text.squeeze(' ')
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands
Upvotes: 1
Reputation: 116
It can be done in quick and dirty way or in a sophisticated way. I am showing the sophisticated way:
require 'rubygems'
require 'hpricot' # you may need to install this gem
require 'open-uri'
## first getting the embeded/framed html file's url
start_url = 'http://news.bnonews.com/u4z3'
doc = Hpricot(open(start_url))
news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/)
## now getting the news text, its in the 3rd <p> tag of the framed html file
doc2 = Hpricot(open(news_html_url.to_s))
news_text = doc2.at('//p[3]').to_plain_text
puts news_text
Try to understand what the code is doing in each step. And apply the knowledge in your future projects. Take help from these pages:
http://wiki.github.com/why/hpricot/an-hpricot-showcase
http://code.whytheluckystiff.net/doc/hpricot/
Upvotes: -1
Reputation: 29303
foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
puts r
# @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands
Upvotes: 1