Nguyen Thai Phong
Nguyen Thai Phong

Reputation: 49

Exclude unexpected characters in a column

The final output is to exclude unexpected characters in product title.

Input:


import pandas as pd

df = pd.DataFrame({'product': [['cream', 'with', 'vegetable', 'oil', '(4+2', 'free)', '170', 'g', 'almarai', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN']],
                   })

Expected output
1. To keep text and number only.
2. To drop everything else

creamwithvegetableoil42free170galmarai

Thanks!

Upvotes: 0

Views: 25

Answers (1)

Erfan
Erfan

Reputation: 42916

First we use Series.explode to unnest your list to a row for each element

Then we can use Series.str.extractall with \w+ and \d+ to get all word characters + numbers out of your column.

Finally we use GroupBy.agg with ''.join to concatenate your strings back to each other:

(df['product'].explode()
 .str.extractall('(\w+|\d+)')
 .replace('NaN', '')
 .groupby(level=0).agg(''.join))

                                        0
0  creamwithvegetableoil42free170galmarai

Upvotes: 1

Related Questions