Parsing unicode in tweets using python

Question

I've tried all manner of approaches to reading in this (sample) of tweets from this file. The unicode character Victory Hand does not seem to want to parse. Here's the data sample.

399491624029274112,Kyle aka K-LO,I unlocked 2 Xbox Live achievements in WWE 2K14! http://t.co/wRIxZTjYWg,False,0,Raptr,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,0
399491626584014848,Dots Group LLC,GeekWire Radio: Amazon vs. author  Xbox One first take  and favorite iPad apps - GeekWire http://t.co/jbbryoHpHe,False,0,IFTTT,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,2
399491630149169152,BETTINGGENIUS!,RT @xJohn69: Sergio Ramos giveaway!; XBOX + PS3; ; -RT; -Follow me and @NeillWagers; -S/Os appreciated; ; Goodluck http://t.co/D997faGSB5,False,0,Twitter for iPad,,,,,2013,11,10,11,0,1,1,0,1,0,0,0,0,2
399491635735953408,Princess of TV,Toy Story of Terror is amaze balls. Thanks Xbox for the free NowTV #disneyweekend,False,0,Twitter for iPhone,,,,,2013,11,10,11,0,2,0,0,1,0,0,0,0,2
399491654136369152,Sam Hambre,'9 Things You Should Know Before Buying a PlayStation 4'  http://t.co/Q3Ma1R83cF,False,0,Buffer,,,,,2013,11,10,11,0,7,0,1,0,0,0,0,0,0
399491655780167680,Rhi ✌,@Escape2theMoon that's done what? im not on rn obvs i dont even have access to an xbox :c ?,False,0,web,399490703761223680,Escape2theMoon,1404625770,,2013,11,10,11,0,7,0,0,1,0,0,0,0,0

You can see the victory hand in the second field of the last tweet.

What I want to do is build a long string from all of the tweets. Staring very simply, I cannot even process this script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import csv

current_file = codecs.open("C:/myfile.csv", encoding="utf-8")
data = csv.reader(current_file, delimiter=",")

tweets = ""

for record in data:
    tweets = tweets + " " + record[2].encode('utf-8', errors='replace')

I've tried many permutations of importing, encoding, concatenating, converting to unicode, etc... but I cannot get past the victory hand. The error I always receive is:

UnicodeEncodeError                        Traceback (most recent call last)
 in ()
----> 1 for record in data:
      2     tweets = tweets + ' ' + record[2].encode('utf-8', 'replace')

UnicodeEncodeError: 'ascii' codec can't encode character u'\u270c' in position 23: ordinal not in range(128)

What am I doing wrong? How can I concatenate all of these tweets into a single string without unicode problems?

alko · Accepted Answer

Problem is with csv.reader which tries to convert unicode back to ascii. Note from the csv docs:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

As proposed, you can use this recipe from the docs examples:

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

With unicode_csv_reader helper utility your code might be as follows (slightly modified to use a closure and a join istead of a loop):

from operator import itemgetter

tweets_fname = "C:/myfile.csv"

with codecs.open(tweets_fname , encoding="utf-8") as current_file:
    data = unicode_csv_reader(current_file, delimiter=",")
    tweets = u' '.join(map(itemgetter(2), data))
    encoded_tweets = tweets.encode('utf8', 'replace')

Parsing unicode in tweets using python

Answers (1)

Related Questions