Reputation: 21
I am trying to extract tweets with Python and store them in a CSV file, but I can't seem to include all languages. Arabic appears as special characters.
def recup_all_tweets(screen_name,api):
all_tweets = []
new_tweets = api.user_timeline(screen_name,count=300)
all_tweets.extend(new_tweets)
#outtweets = [[tweet.id_str, tweet.created_at, tweet.text,tweet.retweet_count,get_hashtagslist(tweet.text)] for tweet in all_tweets]
outtweets = [[tweet.text,tweet.entities['hashtags']] for tweet in all_tweets]
# with open('recup_all_tweets.json', 'w', encoding='utf-8') as f:
# f.write(json.dumps(outtweets, indent=4, sort_keys=True))
with open('recup_all_tweets.csv', 'w',encoding='utf-8') as f:
writer = csv.writer(f,delimiter=',')
writer.writerow(["text","tag"])
writer.writerows(outtweets)
# pass
return(outtweets)
Upvotes: 2
Views: 1712
Reputation: 568
I searched a lot and finally wrote the following piece of code:
import arabic_reshaper
from bidi.algorithm import get_display
import numpy as np
itemsX = webdriver.find_elements(By.CLASS_NAME,"x1i10hfl")
item_linksX = [itemX.get_attribute("href") for itemX in itemsX]
item_linksX = filter(lambda k: '/p/' in k, item_linksX)
counter = 0
for item_linkX in item_linksX:
AllComments2 = []
counter = counter + 1
webdriver.get(item_linkX)
print(item_linkX)
sleep(11)
comments = webdriver.find_elements(By.CLASS_NAME,"_aacl")
for comment in comments:
try:
reshaped_text = arabic_reshaper.reshape(comment.text)
bidi_text = get_display(reshaped_text)
AllComments2.append(reshaped_text)
except:
pass
df = pd.DataFrame({'col':AllComments2})
df.to_csv('C:\Crawler\Comments' + str(counter) + '.csv', sep='\t', encoding='utf-16')
This code worked perfectly for me. I hope it helps those who haven't used the code from the previous post
Upvotes: 0
Reputation: 306
Maybe try with
f.write(json.dumps(outtweets, indent=4, sort_keys=True, ensure_ascii=False))
Upvotes: 0
Reputation: 177891
Example of writing both CSV and JSON:
#coding:utf8
import csv
import json
s = ['عربى','عربى','عربى']
with open('output.csv','w',encoding='utf-8-sig',newline='') as f:
r = csv.writer(f)
r.writerow(['header1','header2','header3'])
r.writerow(s)
with open('output.json','w',encoding='utf8') as f:
json.dump(s,f,ensure_ascii=False)
output.csv:
header1,header2,header3
عربى,عربى,عربى
output.csv viewed in Excel:
output.json:
["عربى", "عربى", "عربى"]
Note Microsoft Excel needs utf-8-sig
to read a UTF-8 file properly. Other applications may or may not need it to view properly. Many Windows applications required a UTF-8 "BOM" signature at the start of a text file or will assume an ANSI encoding instead. The ANSI encoding varies depending on the localized version of Windows used.
Upvotes: 1