Reputation: 563
While using tweepy I came to know about encode(utf-8). I believe encode utf-8 is used to display tweets only in English, Am i right in this regard beacuse I want to make data sets of tweets which are only Written in English, so I can process that tweets for NLP
Upvotes: 1
Views: 963
Reputation: 365717
You're not right.
Unicode is a set of characters intended to cover everything needed for every language and writing system in the world1 (plus technical stuff like math symbols).
It's not used only for English. In fact, it's the exact opposite: before Unicode, handling non-English text was hugely painful, and Unicode is the solution everyone came up with for that problem.
UTF-8 is a way of encoding Unicode characters in a binary stream. It's nothing specific to Tweepy; it's almost universal nowadays, as the default way to encode text (in any language) to disk, network, etc.
In Python, s.encode('utf-8')
takes a Unicode string s
, encodes it using UTF-8, and returns the raw bytes. You only need to call encode
if you're working with binary files, network protocols, or APIs somewhere. Normally, you just open text files in text mode and read and write Unicode strings, and your print
s and input
s and sys.argv
and so on are also Unicode strings, and when you get some JSON data off the network you just json.loads
it and all of the strings are Unicode, and so on.
The official Python Unicode HOWTO explains a lot more of the history, background, and under-the-covers detail. If you're using Python 3.4 or 2.7 or something, you definitely need to read it. If you're using current Python, it's not as essential, but it's still a useful education.
1. There are a few groups who aren't happy with parts of Unicode, mainly to do with the fact that forces all of the CJK languages to share the same notion of alternate characters. So, if you have an unusual Japanese surname, you might insist that Unicode doesn't really handle every language and writing system. But it's still clearly intended to do so—and definitely not intended to be English-only.
Upvotes: 1
Reputation: 2224
No, UTF-8 is a mechanism for encoding Unicode content. This means that it supports almost all scripts of the vast majority of human languages.
Upvotes: 0