Reputation: 31
I am having raw input in text format having special characters in string.I want to change these special character from strings so that after running code there will not be any special character in it.
I tried to write below code.I am not sure whether it is right or wrong.
def avoid(x):
#print(x)
#value=[]
for ele in range(0, len(x)):
p=invalidcharch(ele)
#value.append(p)
#value=''.join(p)
print(p)
return p
def invalidcharch(e):
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}
for i, j in items.items():
e = e.replace(i, j)
return e
for col in df.columns:
df[col]=df[col].apply(lambda x:avoid(x))
but in above code I am unable to store whole string in variable p.I need to store whole string value in p so that it will store replace cell value. Data containing mix datatype values like string integer.
col A
Junto à Estação de Carcavelos;
Bragança
Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet.
Cartão MOBI.E
R. Conselheiro Emídio Navarro (frente ao ISEL)
After chnage
Junto a Estacao de Carcavelos;
Braganca
Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
Cartao MOBI.E
R. Conselheiro Emidio Navarro (frente ao ISEL)
Upvotes: 1
Views: 1984
Reputation: 195408
Using standard unicodedata
module:
import unicodedata
df["col A"] = df["col A"].apply(
lambda x: unicodedata.normalize("NFD", x)
.encode("ascii", "ignore")
.decode("utf-8")
)
print(df)
Prints:
col A
0 Junto a Estacao de Carcavelos;
1 Braganca
2 Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3 Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Upvotes: 1
Reputation: 20669
We can use Series.str.translate
which is equivalent to str.maketrans
+ str.translate
in python.
converter = str.maketrans(items) # `items` is special chars dict.
df['colA'].str.translate(converter)
0 Junto a Estacao de Carcavelos;
1 Braganca
2 Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3 Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Name: col A, dtype: object
Upvotes: 1
Reputation: 30
You can do that simply with the following part of code.
for i in df.columns:
df[i] = df[i].replace(items, regex=True)
Upvotes: 0
Reputation: 16673
Adding to Achille Huet's comment that links this question, you can use this on a pandas dataframe column like this:
import unidecode
df['col A'] = df['col A'].apply(lambda x: unidecode.unidecode(x))
OR
import unidecode
for col in df.columns:
df[col]=df[col].apply(lambda x: unidecode.unidecode(x))
However, since you have already created the special characters dictionary, you can use it:
Just create a dictionary special_chars
and replace
the values on the entire dataframe by passing regex=True
. This should also be faster. I don't know if there is a faster solution using unicode. It also depends on what you are doing with it. If sending to a .csv file for example, I believe there is a parameter in to_csv()
as well, but I am not sure if that is relevant:
special_chars = {"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"",
"ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N",
"Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O",
"ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}
df.replace(special_chars, regex=True)
Upvotes: 2
Reputation: 9047
Not fully understood what you are trying to achieve, but you can try something like
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}
df = pd.DataFrame([
'abcä',
'Ãbcd12345'
], columns=['colA'])
df['colA'] = df['colA'].str.replace(r'[^\x00-\x7F]', lambda x: items.get(x.group(0)) or '_', regex=True)
df
colA
0 abca
1 Abcd12345
For r'[^\x00-\x7F]
check Regular expression that finds and replaces non-ascii characters with Python
Upvotes: 0