Reputation: 161
I'm trying to scrape a page in japanese using python, curl, and BeautifulSoup. I then save the text to a MySQL database that's using utf-8 encoding, and display the resulting data using Django.
Here is an example URL:
I have a function I use to extract the HTML as a string:
def get_html(url):
c = Curl()
storage = StringIO()
c.setopt(c.URL, str(url))
cookie_file = 'cookie.txt'
c.setopt(c.COOKIEFILE, cookie_file)
c.setopt(c.COOKIEJAR, cookie_file)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
return storage.getvalue()
I then pass it to BeautifulSoup:
html = get_html(str(scheduled_import.url))
soup = BeautifulSoup(html)
It is then parsed and saved it to a database. I then use Django to output the data to json. Here is the view I'm using:
def get_jobs(request):
jobs = Job.objects.all().only(*fields)
joblist = []
for job in jobs:
job_dict = {}
for field in fields:
job_dict[field] = getattr(job, field)
joblist.append(job_dict)
return HttpResponse(dumps(joblist), mimetype='application/javascript')
The resulting page displays bytecode such as:
xe3\x82\xb7\xe3\x83\xa3\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88
\xe8\x81\xb7\xe5\x8b\x99\xe5\x86\x85\xe5\xae\xb9
\xe3\x82\xb7\xe3\x82\xb9\xe3\x82\xb3\xe3\x82\xb7\xe3\x82\xb9\xe3\x83\x86\xe3\x83\xa0\xe3\x82\xba\xe3\x81\xae\xe3\x82\xb3\xe3\x83\xa9\xe3\x83\x9c\xe3\x83\xac\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe4\xba\x8b\xe6\xa5\xad\xe9\x83\xa8\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe4\xba\xba\xe3\x82\x92\xe4\xb8\xad\xe5\xbf\x83\xe3\x81\xa8\xe3\x81\x97\xe3\x81\x9f\xe3\x82\xb3\xe3\x83\x9f\xe3\x83\xa5\xe3\x83\x8b\xe3\x82\xb1\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe3\x81\xab\xe3\x82\x88\xe3\x82\x8a\xe3\
Instead of japanese.
I've been researching all day and have converted my DB to utf-8, tried decoding the text from iso-8859-1 and encoding to utf-8.
Basically I have no idea what I'm doing and would appreciate any help or suggestions I can get so I can avoid spending another day trying to figure this out.
Upvotes: 15
Views: 2986
Reputation: 13496
The examples you posted are somehow the ascii representation of the string. You need to convert this into a python unicode string. Usually you can use string encoding and decoding to do the job. If you are not sure which one is the correct way simply experiment with it in the python console.
Try my_new_string = my_string.decode('utf-8')
to get the python unicode string. This should correctly display in Django templates, can be saved to the DB etc.. As an example you can also just try print my_new_string
and will see it is outputting Japanese characters.
Upvotes: 0