Reputation: 1628
I'm using Nokogiri to parse an HTML document. A representation of the source code which this question is based upon follows:
<td width='400' valign=top>
<b><u>Jenny ID:</u> 8675309</b><br />
Name of Place<br />
Street Address<br />
City, State, Zip<br />
Contact: Jenny Jenny<br />
Phone: 867-5309<br />
Fax:
</td>
I'm using a couple delimiters to retrieve the text between Jenny ID:
and Name of Place
. Using #strip
, I'm unable to strip out the leading space.
> returned_value.inspect
=> " 8675309\r\n "
> returned_value.strip
=> " 8675309"
If I use a test string, #strip
does indeed remove the leading and trailing white space(s).
> test_string = " 11111 "
> test_tring.strip
=> "11111"
How can I completely strip out this leading space? I suspect it's the  
but I cannot rid myself of it.
I promise I'm not this dumb in real life, but this problem is beating me down. It's merciless.
Thank you!
Upvotes: 0
Views: 653
Reputation: 27845
I tried to get the same error like you and created this example:
require 'nokogiri'
html = Nokogiri::HTML(<<-html
<td width='400' valign=top>
<b><u>Jenny ID:</u> 8675309</b><br />
Name of Place<br />
Street Address<br />
City, State, Zip<br />
Contact: Jenny Jenny<br />
Phone: 867-5309<br />
Fax:
</td>
html
)
el = html.css('b').first
txt = el.content.split(':').last
puts txt # ' 8675309'
p txt #"\u00A08675309"
p txt.strip #"\u00A08675309"
The leading character is no space, but \u00A0
(The Unicode Character 'NO-BREAK SPACE' (U+00A0)). It seems strip
does not remove it.
If you remove the no-break space explicit, you get the result you want. If you replace \u00A0
with ' '
(a normal space), then you can remove the space with strip without removing it inside the string.
Code:
p txt.gsub("\u00A0", ' ').strip #-> "8675309"
Alternative you can use (thanks to mu is too short)
p txt.gsub(/\p{Space}/, ' ').strip
This requires UTF-8 code. Without you may get an Encoding::CompatibilityError.
Upvotes: 5