Reputation: 421
I'm working on a pretty data intensive algorithm here, and speed is my top priority. Essentially it involves working with very large strings. Without getting into too much detail, it works in the blink of an eye without these lines of code:
html = unicode(strip_tags(html_source), errors='ignore')
html2 = unicode(strip_tags(html_source2), errors='ignore')
The problems that occur if I don't encode each string into unicode is that I get the dreadful:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5747: ordinal not in range(128)
Is there anything I could do to streamline this process? The little bits of data that aren't in the ascii range are not too important to me. Is there anyway I could just ignore the errors all together without encoding the whole string?
Thank you very much! (I am currently using python2.7.3)
Upvotes: 0
Views: 464
Reputation: 298374
You can strip out all non-ASCII characters with .decode()
:
your_string.decode('ascii', errors='ignore')
Upvotes: 2