Reputation: 21
I am reading a utf8 file with normal python text encoding. I also need to get rid of all the quotes in the file. However, the utf8 code has multiple types of quotes and I can't figure out how to get rid of all of them. The code below serves as an example of what I've been trying to do.
def change_things(string, remove):
for thing in remove:
string = string.replace(thing, remove[thing])
return string
where
remove = {
'\'': '',
'\"': '',
}
Unfortunately, this code only removes normal quotes, not left or right facing quotes. Is there any way to remove all such quotes using a similar format to what I have done (I recognize that there are other, more efficient ways of removing items from strings but given the overall context of the code this makes more sense for my specific project)?
Upvotes: 0
Views: 843
Reputation: 14906
You can just type those sorts of into your file, and replace them same as any other character.
utf8_quotes = "“”‘’‹›«»"
mystr = 'Text with “quotes”'
mystr.replace('“', '"').replace('”', '"')
There's a few different single quote variants too.
Upvotes: 1
Reputation: 16573
There are multiple ways to do this, regex is one:
import re
newstr = re.sub(u'[\u201c\u201d\u2018\u2019]', '', oldstr)
Another clean way to do it is to use the Unidecode
package. This doesn't remove the quotes directly, but converts them to neutral quotes. It also converts any non-ASCII character to its closest ASCII equivalent:
from unidecode import unidecode
newstr = unidecode(oldstr)
Then, you can remove the quotes with your code.
Upvotes: 0
Reputation: 186
There's a list of unicode quote marks at https://gist.github.com/goodmami/98b0a6e2237ced0025dd. That should allow you to remove any type of quotes.
Upvotes: 0