IHeartDuckies
IHeartDuckies

Reputation: 121

Official encoding used by Twitter Streaming API? Is it UTF-8?

What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.

The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:

Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation

https://dev.twitter.com/docs/counting-characters

Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.

Thanks! :D

Upvotes: 11

Views: 11233

Answers (3)

Yves Boutellier
Yves Boutellier

Reputation: 2034

At the moment twitter API v2 does not send their data in UTF-8!

I believe it's UTF-16 and because when decoding data in UTF-8 surrogate pairs remain. Surrogate pairs are only featured in UTF-16.

With the API I received for example this string: 🎁Crypto Heroez epic giveaway🎁

However, it didn't come this way but rather: \ud83c\udf81Crypto Heroez epic giveaway\ud83c\udf81

\ud83c\udf81 is a surrogate pair that translates into a gift emoji 🎁

In Hex code UTF-16BE that wrapped present is encoded with: D8 3C DF 81, in UTF-8 this same emoji is encoded with F0 9F 8E 81

Other developers noticed the same: https://twitterdevfeedback.uservoice.com/forums/930250-twitter-api/suggestions/41152342-utf-8-encoding-of-v2-api-responses

This issue was written on the Aug 15, 2020. But as I am writing today the 9th September 2021, they didn't communicated anything publicly available. (That's why I wanted to post this answer here)

Upvotes: 0

mlissner
mlissner

Reputation: 18166

If they say they use UTF-8, that's a pretty good bet. UTF-8 is very common, and UTF-16 in the wild is pretty rare from what I've seen.

There are also some clever libraries you could use if you were so inclined to prove it to yourself by testing whether they support various characters. The best of these is used by Firefox to detect the encoding of webpages as they're loaded: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Upvotes: 0

deceze
deceze

Reputation: 522175

One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.

Upvotes: 6

Related Questions