shiv_90
shiv_90

Reputation: 1097

How to extract non-numeric special character values in a large Series object?

This is an extended question to this post.

I'm working on a large dataset and it seems to be very messy containing random strings here and there with no order. I would like to check on these values column wise without printing the numeric values of those columns. For example if df.Col is a float object but has strings such as #$, ^%#, etc., is there a way to view/print only those?

EDIT:

For example I have a column Client_Income.

print(str(df["Client_Income"].name).upper())
print("No. of unique values:", df["Client_Income"].nunique())
print(df["Client_Income"].unique())

CLIENT_INCOME
No. of unique values: 1217
['6750' '20250' '18000' ... '13140' '9764.1' '12840.75']

While this shows that the column could be purely float deliberately converted to object, when I do .astype("float16") for example, I get the error Cannot convert to float: '$'. There are other columns that contains values such as 'XNA', '##', etc.

Upvotes: 1

Views: 42

Answers (1)

jezrael
jezrael

Reputation: 863166

Use to_numeric with errors='coerce' for convert values to numeric and if failed is created NaNs, so then filter by NaN in converted Series and non NaNs by original column:

df = pd.DataFrame({
         'Client_Income':['6750', '20.250', '18000','XNA', '##45', '#4.5#', np.nan],

})


s = pd.to_numeric(df['Client_Income'], errors='coerce')
out1 = df.loc[s.isna() & df['Client_Income'].notna(), 'Client_Income']
print (out1)
3      XNA
4     ##45
5    #4.5#
Name: Client_Income, dtype: object

Upvotes: 1

Related Questions