Reputation:

How do I removing URLs from text?

I would like help in parsing text in Ruby.

Given:

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3

I would like to eliminate all the hyperlinks, returning plain text.

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Upvotes: 2

Answers (3)

the Tin Man

Reputation: 160551

This is an old, but good, question. Here's an answer that relies on Ruby's built-in URI:

require 'set'
require 'uri'

text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'

schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i

URI.extract(text).each do |url|
  text.gsub!(url, '') if (url[schemes_regex])
end

puts text.squeeze(' ')

And a pass through IRB showing what's happening and the resulting output:

I defined the text to search:

irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
=> "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

I defined a regex of URI schemes we want to react to. This is a defensive move because URI returns a false-positive in its search step:

irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
=> /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

Let URI walk through the text finding URLs. For each one found, if it's a scheme we want to react to, strip all its occurrences from the text:

irb(main):008:0* URI.extract(text).each do |url|
irb(main):009:1*   text.gsub!(url, '') if (url[schemes_regex])
irb(main):010:1> end

These are the URLs URI.extract found. It erroneously reports BreakingNews: because of the trailing :. I think it's not too sophisticated, but for normal use it's fine:

=> ["BreakingNews:", "http://news.bnonews.com/u4z3"]

Show what the resulting text was:

irb(main):012:0* puts text.squeeze(' ')
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Upvotes: 1

vulcan_hacker

Reputation: 116

It can be done in quick and dirty way or in a sophisticated way. I am showing the sophisticated way:

require 'rubygems'
require 'hpricot' # you may need to install this gem
require 'open-uri'

## first getting the embeded/framed html file's url
start_url = 'http://news.bnonews.com/u4z3'
doc = Hpricot(open(start_url))
news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) 

## now getting the news text, its in the 3rd <p> tag of the framed html file
doc2 = Hpricot(open(news_html_url.to_s))
news_text = doc2.at('//p[3]').to_plain_text
puts news_text

Try to understand what the code is doing in each step. And apply the knowledge in your future projects. Take help from these pages:

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/

Upvotes: -1

hobodave

Reputation: 29303

foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
puts r
# @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Upvotes: 1

How do I removing URLs from text?

Answers (3)

Related Questions