Reputation: 348

Ruby - Strip all HTML tags from string with Regex

I have the following string as an example

"<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"

And I would like to strip all HTML tags from it. I was using the following method which kind of worked

Nokogiri::HTML(CGI.unescapeHTML(@message_preview)).content

But it ultimately produced,

"Hello,my name is SameFarewell,Same"

When I wanted

"Hello, my name is Same Farewell, Same"

Notice the spaces, given a line break, I want there to be a space in its place instead of being the very next character in the string.

I was hoping to try to use gsub or regex but am kind of lost on how to make it happen.

Upvotes: 1

Answers (3)

Viktor Ivliiev

Reputation: 1334

My decision:

description.gsub!(/<("[^"]*"|'[^']*'|[^'">])*>/, ' ').strip

Upvotes: 0

Sagar Pandya

Reputation: 9497

You can use split here passing a regex which works for your example (s is your string).

def wordy s
  s.split(/\<.*?\>/)
   .map(&:strip)
   .reject(&:empty?)
   .join(' ')
   .gsub(/\s,/,',')
end

s = "<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
t = "<p>Hello <strong>Jim</strong>,</p><p> </p><p>This is <em>Charlie</em> and<u> I wanted to say</u></p><ol><li>hello</li><li>goodby</li></ol><p> </p><p>Farewell,</p><p>Lawrence</p>"

p wordy s
#"Hello, my name is Same Farewell, Same"

p wordy t
#"Hello Jim, This is Charlie and I wanted to say hello goodby Farewell, Lawrence"

Upvotes: 2

Aleksei Matiushkin

Reputation: 121000

Unfortunately, Nokogiri::XML::Node#traverse does not return an enumerator when no block is given, that’s why we need this ugly hack with defining a local variable upfront.

require 'nokogiri'

result, input = [], "<p>Hello,</p><p><br></p><p>my name is Same</p>" \
                    "<p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
Nokogiri::HTML(CGI.unescapeHTML(input)).traverse do |e|
  result << e.text if e.text?
end
result.join(' ')
#⇒ "Hello, my name is Same Farewell, Same"

Upvotes: 2

Ruby - Strip all HTML tags from string with Regex

Answers (3)

Related Questions