Big_VAA
Big_VAA

Reputation: 794

Remove Weird Characters using python

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.

Ex: INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');

i want to remove only the weird characters and leave all of the normal ones

I tried using the following code to do so:

import fileinput
import string

fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")

for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
    print(line)
    s = line
    printable = set(string.printable)
    filter(lambda x: x in printable, s)
    print(s)
    file.write(s)

but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.

Any advice or tips on how to solve this would be useful

Upvotes: 0

Views: 2426

Answers (2)

GLHF
GLHF

Reputation: 4044

    import string

strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;:  """
for x in strg:
    if x in string.ascii_letters or x in string.digits or x in acc:
        newstrg += x
print (newstrg)

Output;

'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>

You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.

Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

Upvotes: 2

David Lai
David Lai

Reputation: 822

You can use regular expressions sub() to do simple string replacements. https://docs.python.org/2/library/re.html#re.sub

# -*- coding: utf-8 -*-

import re

dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)

print clean_string
# Outputs
>>> sinformation

Upvotes: -1

Related Questions