Renato Dinhani
Renato Dinhani

Reputation: 36686

How to discover the Unicode codepoint and UTF-8 encoded value of a unknown character?

I'm doing text mining on content that comes from the web. There is a lot of chars that I want to convert to perform better classification (eg.: &nbsp to white spaces).

The problem is sometimes I'm getting some unknown chars and I want to discover the Unicode codepoint and UTF-8 representation of it.

I want to know if there is some online tool that can inform this or a program.

At the moment, I'm trying to discover a line-break that I found, but don't matches the \n or \s from regex. In the past time, I had troubles with the &nbsp.

I don't know what is and I want to know if there is a way to discover.

The char appears here, after personagens, but is only possible to see viewing the original code without formatation.

"personagens "

Upvotes: 1

Views: 1487

Answers (2)

tchrist
tchrist

Reputation: 80384

Run the uniquote program:

$ echo 'bád⁠⁠ƨtüff' | uniquote -x
b\x{E1}d\x{2060}\x{2060}\x{1A8}t\x{FC}\x{FB00}

$ echo 'bád⁠⁠ƨtüff' | uniquote -v
b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}

$ echo 'bád⁠⁠ƨtüff' | uniquote --html
bád⁠⁠ƨtüff

You don’t need to use echo; you can just cut and paste, then hit ^D when you’re done:

$ uniquote -v -
'bád⁠⁠ƨtüff'
^D
'b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}'

Upvotes: 2

Vineet Reynolds
Vineet Reynolds

Reputation: 76709

Based on the comments, it appears that you needed to know the Unicode codepoints of certain characters, or their UTF-8 representations.

You can use the character inspector application, written by McDowell, who's one of StackOverflow's users, to determine the Unicode codepoint as well as the UTF-8 representations. You'll need to set the charset as UTF-8 in the application, once you've pasted the contents of the message.

You can also use the String class of the Java API to get the raw codepoints of characters in a String, via the codePointAt method. Note, that if you convert the String to a char array, the array will contain UTF-16 encoded characters; while, this is fine if you intend to invoke the Character.codePointAt method, you must take care to ensure that you deal with low surrogates.

Upvotes: 3

Related Questions