spex
spex

Reputation: 1137

Google Calculator Thousands Separator Special Character

NOTE: For more answers related to this, please see Special Characters in Google Calculator

I noticed when grabbing the return value for a Google Calculator calculation, the thousands place is separated by a rather odd character. It is not simply a space.

Let's take the example of converting $4,000 USD to GBP.

If you visit the following Google link:

http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2 497.81441 British pounds",error: "",icc: true}

This looks reasonable, and the thousands place appears to be separated by a whitespace character.

However, if you enter the following into your command line:

curl -s "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2?498.28243 British pounds",error: "",icc: true}

That question mark (?) is a replacement character. What is going on?

AppleScript returns a different replacement character:

{lhs: "4000 U.S. dollars",rhs: "2†498.28243 British pounds",error: "",icc: true}

I am also getting from other sources:

{lhs: "4000 U.S. dollars",rhs: "2�498.28243 British pounds",error: "",icc: true}

It turns out that � is the proper Unicode replacement character 65533.

Can anyone give me insight into what Google is passing me?

Upvotes: 2

Views: 910

Answers (3)

jackjr300
jackjr300

Reputation: 7191

According to my tests with curl in the Terminal on OSX, by changing the International character encoding in the Terminal preferences : The encoding is iso latin 1.

When I set the encoding to UTF8 : I get "2?498.28243"

When I set the encoding to MacRoman : I get "2†498.28243"

First solution : use a user agent from any browser (Safari on OSX 10.6.8 in this example)

curl -s -A 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.48 (KHTML, like Gecko) Version/5.1 Safari/534.48' 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp'

Second solution : use iconv

curl -s 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp' |  iconv -t utf8 -f  iso-8859-1

Upvotes: 2

Joey
Joey

Reputation: 354744

It's a non-breaking space, U+00A0. It's to ensure that the number won't get broken at the end of a line.

Google returns the correct encoding (UTF-8) however:

Content-Type: text/html; charset=UTF-8

so ...

  • if it comes out as a normal space (U+0020) instead (Firefox does that when copying, stupidly enough), then the application performs conversion of certain characters to lookalikes, maybe to fit in some sort of restricted code page (ASCII perhaps).
  • if there is a question mark, then it was correctly read as Unicode but some part in processing uses a legacy character set that doesn't contain that character so it gets converted.
  • if there is a replacement character � (U+FFFD) then it was likely read as UTF-8, converted into a legacy character set that contains the character (e.g. Latin 1) and then re-interpreted as UTF-8.
  • if there is a totally different character, such as your dagger (†), then I'd guess the response is read correctly as Unicode, gets converted to a character set that contains the character and re-interpreted in another character set. A quick look at the Mac Roman codepage reveals that A0 indeed maps to †.

Needless to say, some parts in whatever you use in processing that response seem to be horrible broken in regard to Unicode. Something I'd hope wouldn't really happen that often in this millennium, but apparently it still does.


I figured out what it was by fiddling around in PowerShell a bit:

PS Home:\> $wc = new-object net.webclient
PS Home:\> $x = $wc.downloadstring('http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp')
PS Home:\> [char[]]$x|%{"$_ - " + +$_}
...
" - 34
2 - 50
  - 160
4 - 52
9 - 57
8 - 56
. - 46
2 - 50
8 - 56
2 - 50
4 - 52
...

Also a quick look at the response headers revealed that the encoding is set correctly.

Upvotes: 3

adayzdone
adayzdone

Reputation: 11238

Try

set myUrl to quoted form of "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"
set xxx to do shell script "curl " & myUrl & " | sed 's/[†]/,/'"

Upvotes: 0

Related Questions