Reputation: 5

How do I decode strings in 'utf-8'?

I'm using tweepy to capture some tweets in Portuguese and I'm saving these tweets in a csv file. All tweet text we're saved with special characters and now I can't convert then to the correct format.

My coding for the tweet capture is:

csvFile = open('ua.csv', 'a')
csvWriter = csv.writer(csvFile)
for tweet in tweepy.Cursor(api.user_timeline,id=usuario,count=10,
                           lang="en",
                           since="2018-12-01").items():
csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

I'm reading the results like this:

test = pd.read_csv('ua.csv', header=None)
test.columns = ["date", "text"]
result = test['text'][0]
print(result)
'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

The result I need sholud be this:

print(result)
'Aproveita essa promoção aqui!'

I tried this code to convert:

print(result.decode('utf-8'))

and got this error message:

AttributeError: 'str' object has no attribute 'decode'

Where am I doing wrong ?

Upvotes: 0

Answers (3)

Mark Tolonen

Reputation: 178429

Open the file with the encoding to be used. Don't encode it manually (Zen of Python: Explicit is better than implicit):

# newline='' per csv documentation
# encoding='utf-8-sig' if you plan on using Excel to read the csv, else 'utf8' is fine.
with open('ua.csv','a',encoding='utf-8-sig',newline='') as csvFile:
    csvWriter = csv.writer(csvFile)
    for tweet in tweepy.Cursor(api.user_timeline,id=usuario,count=10,
                               lang="en",
                               since="2018-12-01").items():
    csvWriter.writerow([tweet.created_at, tweet.text)

Here's a working example:

import csv
import pandas as pd

with open('ua.csv','w',encoding='utf-8-sig',newline='') as csvFile:
    csvWriter = csv.writer(csvFile)
    csvWriter.writerow(['timestamp','Aproveita essa promoção aqui!'])

test = pd.read_csv('ua.csv', encoding='utf-8-sig', header=None)
print(test)

Output:

           0                              1
0  timestamp  Aproveita essa promoção aqui!

Upvotes: 0

juanpa.arrivillaga

Reputation: 96360

The problem is that you are creating a bytes object when you .encode your tweet, you don't need to do this.

A csv.writer object will coerce to string whatever you pass to it.

Note:

In [1]: import csv

In [2]: s = 'Aproveita essa promoção aqui!'

In [3]: print(s)
Aproveita essa promoção aqui!

In [4]: print(s.encode())
b'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

In [5]: with open('test.txt', 'a') as f:
   ...:     writer = csv.writer(f)
   ...:     writer.writerow([1, 3.4, 'Aproveita essa promoção aqui!'.encode()])
   ...:

In [6]: !cat test.txt
1,3.4,b'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

So just use:

csvWriter.writerow([tweet.created_at, tweet.text])

Upvotes: 1

Scott Hunter

Reputation: 49920

The pandas read_csv has an encoding parameter:

Encoding to use for UTF when reading/writing (ex. ‘utf-8’).

Upvotes: 0

How do I decode strings in &#39;utf-8&#39;?

Answers (3)

Related Questions

How do I decode strings in 'utf-8'?