Tokenize data in Python(converting data into patterns)

Question

I have a dataframe which is like the one below:

Name      | City

Apple     | Tokyo
Papaya    | Pune
TimGru334 | Shanghai
236577    | Delhi

I need to iterate through each value and need to tokenise data in Python. To explain in detail:

For the value 'Apple', this should be converted to 'ccccc' where c indicates a character.
For 'TimGru334', this should be converted to 'ccccccddd'
Consider the value '236577', this should be converted to 'dddddd' where d indicates a digit/number.

Can someone help me out please?

P.S: I'm new to the platform, so please excuse me if I'm wrong in any manner. Thanks in advance :)

jezrael · Accepted Answer

Use Series.replace - first non numeric and then numeric values - order of values in lists is important:

df['Name'] = df['Name'].replace(['\D', '\d'], ['c','d'], regex=True)
print (df)
        Name      City
0      ccccc     Tokyo
1     cccccc      Pune
2  ccccccddd  Shanghai
3     dddddd     Delhi

If need replace all columns:

df = df.replace(['\D', '\d'], ['c','d'], regex=True)
print (df)
        Name      City
0      ccccc     ccccc
1     cccccc      cccc
2  ccccccddd  cccccccc
3     dddddd     ccccc

Tokenize data in Python(converting data into patterns)

Answers (2)

Related Questions