Aaron Yodaiken
Aaron Yodaiken

Reputation: 19551

remove whitespace from html document using ruby

So I have a string in ruby that is something like

str = "<html>\n<head>\n\n  <title>My Page</title>\n\n\n</head>\n\n<body>" +
      "  <h1>My Page</h1>\n\n<div id=\"pageContent\">\n  <p>Here is a para" +
      "graph. It can contain  spaces that should not be removed.\n\nBut\n" +
      "line breaks that should be removed.</p></body></html>"

How would I remove all whitespace (spaces, tabs, and linebreaks) that is outside of a tag/not inside a tag that has content like <p> using only native Ruby?

(I'd like to avoid using XSLT or something for a task this simple.)

Upvotes: 4

Views: 4127

Answers (4)

phil pirozhkov
phil pirozhkov

Reputation: 4900

xml.squish.gsub /(> <)/, '><'

Even shorter than above.

PS I love the funny faces.

Upvotes: 0

user1158559
user1158559

Reputation: 1954

Hate to split hairs about regexen, but none of the other answers are strictly correct. This will work:

str.gsub(/>\s*/, ">").gsub(/\s*</, "<")

Explicitly converting newlines is unnecessary because /\s/ matches all whitespace characters including newline. The regexen in the other answers are not strictly correct because their regexen fail to match "\r", which is used at the end of lines in Windows and will appear in emails.

My line will also convert <p> foo bar </p> into <p>foo bar</p>, but you may not want this.

Upvotes: 7

domhabersack
domhabersack

Reputation: 431

str.gsub!(/\n\t/, " ").gsub!(/>\s*</, "><")

That first gsub! replaces all line breaks and tabs with spaces, the second removes spaces between tags.

You will end up with multiple spaces inside your tags, but if you just removed all \n and \t, you would get something like "not be removed.Butline breaks", which is not very readable. Another Regular Expression or the aforementioned .squeeze(" ") could take care of that.

Upvotes: 13

Justin L.
Justin L.

Reputation: 13600

You can condense all groups of space characters into one space (ie, hello world into hello world) by using String#squeeze:

"hello     world".squeeze(" ")  # => "hello world"

Where the parameter of squeeze is the character to be squeezed.

EDIT: I misread your question, sorry.

This would

  • remove consecutive spaces within tags
  • leave individual spaces outside tags

I'll work on a solution right now.

Upvotes: 1

Related Questions