Reputation: 969
I am working on a text classification problem. My CSV file contains a column called 'description' which describes events. Unfortunately, that column is full of special characters apart from English words. Sometimes the entire field in a row is full of such characters, or, sometimes, few words are of such special characters and the rest are English words. I am showing you two specimen fields of two different rows:
हर वर्ष की तरह इस वर्ष भी सिंधु सेना द्वारा आयोजित सिंधी प्रीमियर लीग फुटबॉल टूर्नामेंट का आयोजन एमबीएम ग्राउंड में करने जा रही है जिसमें अंडर-19 टीमें भाग लेती है आप सभी से निवेदन है समाज के युवाओं को प्रोत्साहन करने अवश्य पधारें
Unwind on the strums of Guitar & immerse your soul into the magical vibes of music! ️? ️?..Guitar Night By Ashmik Patil.July 19, 2018.Thursday.9 PM Onwards.*Cover charges applicable...#GuitarNight #MusicalNight #MagicalMusic #MusicLove #Party #Enjoy #TheBarTerminal #Mumbaikars #Mumbai
In the first one the entire field is full of such unreadable characters, whereas in the second case, only few such characters are present. Rest of them are English words.
I want to remove only those special chars keeping the English words as they are, as I need those English words to form a bag of words at a later stage.
How to implement that with Python ( I am using a jupyter notebook) ?
Upvotes: 1
Views: 2862
Reputation: 917
You can encode your string to ascii
and ignore
the errors.
>>> text = 'Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖'
>>> text = text.encode('ascii', 'ignore')
Which will give you a binary object, which you can further decode again to utf
>>> text
b'Something with special characters '
>>> text = text.decode('utf')
>>> text
'Something with special characters '
Upvotes: 2
Reputation: 4417
You could use pandas to read the csv file into a dataframe. using:
import pandas as pd
df = pd.read_csv(fileName,convertor={COLUMN_NUMBER:func})
where func, is a function that takes a single string and removes special characters. this can be done in different ways, using regex, but here is a simple one
import string
def func(strg):
return ''.join(c for c in strg if c in string.printable[:-5])
alternatively you can read the dataframe first then apply to alter the description column. ie.
import pandas as pd
df = pd.read_csv(fileName)
df['description'] = df['description'].apply(func)
or using regex
df['description'] = df['description'].str.replace('[^A-Za-z _]','')
string.printable[:-5 ]
is the set of characters '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~ '
Upvotes: 0
Reputation: 131
You can do this by using regex. Assuming that you have been able to take out the text from the CSV file -
#python 2.7
import re
text = "Something with special characters á┬ñ┬╡├á┬ñ┬░├á┬Ñ┬ì├á┬ñ┬╖"
cleaned_text = re.sub(r'[^\x00-\x7f]+','', text)
print cleaned_text
Output - Something with special characters
To understand the regex expression used, refer here.
Upvotes: 2