Kurren
Kurren

Reputation: 81

Generate list from string with proper encoding (UTF-8)

I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).

The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u\000000 form.

My code is:

import json

with open("file_name.txt") as tweets_file:
    tweets_list = [] 
    for a in tweets_file:
        b = json.loads(a)
        tweets_list.append(b)

    tweet = []
    for i in tweets_list:
        key = "text"
        if key in i:
            t = i["text"]
            tweet.append(t)

    for k in tweet:
        print k.encode("utf-8")

As an alternative, I tried to have the encoding at the beginning (when fetching the file):

import json
import codecs

tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = [] 
for a in tweets_file:
    b = json.loads(a)
    tweets_list.append(b)
tweets_file.close()

tweet = []
for i in tweets_list:
    key = "text"
    if key in i:
        t = i["text"]
        tweet.append(t)

for k in tweet:
    print k

My question is: how can I put the resulting k strings, into a list? With each k string as an item?

Upvotes: 0

Views: 3357

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123940

You are getting confused by the Python string representation.

When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.

Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as '\u....' escape sequences. Characters in the latin-1 range use the '\x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as \n and \t.

The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:

>>> print u'\u2036Hello World!\u2033'
‶Hello World!″
>>> u'\u2036Hello World!\u2033'
u'\u2036Hello World!\u2033'
>>> [u'\u2036Hello World!\u2033', u'Another\nstring']
[u'\u2036Hello World!\u2033', u'Another\nstring']
>>> print _[1]
Another
string

This entirly normal behaviour. In other words, your code works, nothing is broken.

To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:

import json

with open("file_name.txt") as tweets_file:
    tweets = [] 
    for line in tweets_file:
        data = json.loads(a)
        if 'text' in data:
            tweets.append(data['text'])

Upvotes: 3

Related Questions