Reputation: 245
I've written a beautifulsoup script that scrapes Japanese HTML. Everything seems to be working and I get zero error messages. When I print I get:
連鎖に打ち勝たねばならない」と述べ拍手を浴び etc
But in the same script, when I save the output in a csv I get:
\u5ddd\u3001\u6ce2\u4f50\u5834\uff13\u7279\u6d3e\u54e1\u304c\u8a71\u3057\u5408 etc
I assume the problem is in the write-to-csv part of the code, but I can't figure out what to do.
Here's the code:
def processData( pageFile ):
f = open(pageFile, "r")
page = f.read()
f.close()
soup = BeautifulSoup(page, 'html.parser')
metaData = soup.find_all("div", {'class': 'detail001'})
one = [ ]
for html in metaData:
text = BeautifulSoup(str(html).strip().replace("\n", ""),features="html.parser")
text = text.get_text()
one.append(text.strip())
csvfile = open(dir2 + ".csv".encode("utf-8"), 'ab')
writer = csv.writer(csvfile)
for ones in zip(one):
writer.writerow([one])
csvfile.close()
dir1 = "/home/sveisa/"
dir2 = "test2"
dir = dir1 + dir2
csvFile = dir2 + ".csv"
csvfile = open(csvFile.encode("utf-8"), 'w')
writer = csv.writer(csvfile)
writer.writerow(["one"])
csvfile.close()
fileList = os.listdir(dir)
totalLen = len(fileList)
for htmlFile in fileList:
path = os.path.join(dir, htmlFile)
processData(path)
I'm using Ubuntu.
Upvotes: 2
Views: 426
Reputation: 11515
It's about the encoding=
which need to be assigned to your csv
as the following:
with open("data.csv", 'w', encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(
"\u5ddd\u3001\u6ce2\u4f50\u5834\uff13\u7279\u6d3e\u54e1\u304c\u8a71\u3057\u5408")
Output Content:
川、波佐場3特派員が話し合
Upvotes: 3