Converting twitter archive unicode escape

Question

I have a twitter archive containing about 60 000 tweets, mostly in french. In these, accented characters are represented using U-escaped hexadecimal. E.g the word "animée" is represented as "anim\u00E9e". Now, I want to convert this to UTF-8. The good news is, there's a unix utility for that, called ascii2uni.

The bad news is that that apparently anything that can be interpreted as an hex digit will be interpreted as such. Therefore instead of "animée", I end up with this nonsense : "animພ"

So how can I convert those tweets to UTF-8 in a way that doesn't mangle it like that ?

Roland Illig · Accepted Answer

The ascii2uni program’s default format does not work well. But fortunately, you can define your own custom format.

echo 'aim\u00E9e \uD852\uDF62 bbb' | ascii2uni -Z '\u%04X'

The Chinese character is taken from https://en.wikipedia.org/wiki/UTF-16#Examples.

Converting twitter archive unicode escape

Answers (1)

Related Questions