manuja
manuja

Reputation: 31

replacing special characters from string

I am having raw input in text format having special characters in string.I want to change these special character from strings so that after running code there will not be any special character in it.

enter image description here

enter image description here

I tried to write below code.I am not sure whether it is right or wrong.

def avoid(x):
#print(x)
#value=[]
for ele in range(0, len(x)):
    
    p=invalidcharch(ele)
    #value.append(p)
      #value=''.join(p)
    print(p)    
return p
def invalidcharch(e):
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

for i, j in items.items():
    e = e.replace(i, j)
return e

for col in df.columns:
 df[col]=df[col].apply(lambda x:avoid(x))

but in above code I am unable to store whole string in variable p.I need to store whole string value in p so that it will store replace cell value. Data containing mix datatype values like string integer.

col A
Junto à Estação de Carcavelos;
Bragança
Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet.
Cartão MOBI.E R. Conselheiro Emídio Navarro (frente ao ISEL)

After chnage
Junto a Estacao de Carcavelos;
Braganca
Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
Cartao MOBI.E
R. Conselheiro Emidio Navarro (frente ao ISEL)

Upvotes: 1

Views: 1984

Answers (5)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Using standard unicodedata module:

import unicodedata

df["col A"] = df["col A"].apply(
    lambda x: unicodedata.normalize("NFD", x)
    .encode("ascii", "ignore")
    .decode("utf-8")
)
print(df)

Prints:

                                                                      col A
0                                            Junto a Estacao de Carcavelos;
1                                                                  Braganca
2  Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3              Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)

Upvotes: 1

Ch3steR
Ch3steR

Reputation: 20669

We can use Series.str.translate which is equivalent to str.maketrans + str.translate in python.

converter = str.maketrans(items) # `items` is special chars dict.
df['colA'].str.translate(converter)

0                                              Junto a Estacao de Carcavelos;
1                                                                    Braganca
2    Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3                Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Name: col A, dtype: object

Upvotes: 1

choka
choka

Reputation: 30

You can do that simply with the following part of code.

for i in df.columns:

    df[i] = df[i].replace(items, regex=True)

Upvotes: 0

David Erickson
David Erickson

Reputation: 16673

Adding to Achille Huet's comment that links this question, you can use this on a pandas dataframe column like this:

import unidecode
df['col A'] = df['col A'].apply(lambda x: unidecode.unidecode(x))

OR

import unidecode
for col in df.columns:
    df[col]=df[col].apply(lambda x: unidecode.unidecode(x))

However, since you have already created the special characters dictionary, you can use it:

Just create a dictionary special_chars and replace the values on the entire dataframe by passing regex=True. This should also be faster. I don't know if there is a faster solution using unicode. It also depends on what you are doing with it. If sending to a .csv file for example, I believe there is a parameter in to_csv() as well, but I am not sure if that is relevant:

special_chars = {"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"",
"ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N",
"Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O",
"ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}

df.replace(special_chars, regex=True)

Upvotes: 2

Epsi95
Epsi95

Reputation: 9047

Not fully understood what you are trying to achieve, but you can try something like

items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

df = pd.DataFrame([
    'abcä',
    'Ãbcd12345'
], columns=['colA'])

df['colA'] = df['colA'].str.replace(r'[^\x00-\x7F]', lambda x: items.get(x.group(0)) or '_', regex=True)

df
    colA
0   abca
1   Abcd12345

For r'[^\x00-\x7F] check Regular expression that finds and replaces non-ascii characters with Python

Upvotes: 0

Related Questions