Reputation: 121
What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.
The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:
Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation
Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.
Thanks! :D
Upvotes: 11
Views: 11233
Reputation: 2034
At the moment twitter API v2 does not send their data in UTF-8!
I believe it's UTF-16 and because when decoding data in UTF-8 surrogate pairs remain. Surrogate pairs are only featured in UTF-16.
With the API I received for example this string: 🎁Crypto Heroez epic giveaway🎁
However, it didn't come this way but rather: \ud83c\udf81Crypto Heroez epic giveaway\ud83c\udf81
\ud83c\udf81
is a surrogate pair that translates into a gift emoji 🎁
In Hex code UTF-16BE that wrapped present is encoded with: D8 3C DF 81, in UTF-8 this same emoji is encoded with F0 9F 8E 81
Other developers noticed the same: https://twitterdevfeedback.uservoice.com/forums/930250-twitter-api/suggestions/41152342-utf-8-encoding-of-v2-api-responses
This issue was written on the Aug 15, 2020. But as I am writing today the 9th September 2021, they didn't communicated anything publicly available. (That's why I wanted to post this answer here)
Upvotes: 0
Reputation: 18166
If they say they use UTF-8, that's a pretty good bet. UTF-8 is very common, and UTF-16 in the wild is pretty rare from what I've seen.
There are also some clever libraries you could use if you were so inclined to prove it to yourself by testing whether they support various characters. The best of these is used by Firefox to detect the encoding of webpages as they're loaded: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
Upvotes: 0
Reputation: 522175
One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.
Upvotes: 6