windyvation
windyvation

Reputation: 551

Convert Numbers that show as strings in Pandas, while ignoring other Strings that need to be kept the same

I have data where the dimension is inconsistent between numeric and string formats, and some of the numeric formats even show up as a string.

Unfortunately astype() and to_numeric() won't work for me here, as you'll see why with the sample data:

import pandas as pd

data = {'col_a': [1, '1', 2, '2', 'non_numerical'], 'col_b': [1, 2, 3, 4, 5]}

df = pd.DataFrame(data, columns=['col_a', 'col_b'])

I would like to aggregate the data, like this:

col_a col_b
1    3
2    7
'non_numerical'    5

I'm thinking that I need to write a for loop with an if else statement to check and convert data types, but I'm hoping there's a simpler way with python and/or pandas.

Thanks!

Upvotes: 1

Views: 1263

Answers (3)

SeaBean
SeaBean

Reputation: 23217

You can firstly use .assign() to create a temporary column of col_a with conversion to string type and with the same column name. Use this new temporary column for .groupby() and aggregation with .sum(), as follows:

df.assign(col_a=df['col_a'].astype(str)).groupby('col_a', as_index=False).sum()

In this way, you can get the aggregation result without modifying the original contents of col_a since we only convert the column to string on the temporary column without touching the original column.

Result:

           col_a  col_b
0              1      3
1              2      7
2  non_numerical      5

Upvotes: 1

anky
anky

Reputation: 75080

to_numeric should work with coercing + fillna:

out = (df.assign(col_a=pd.to_numeric(df['col_a'],errors='coerce').fillna(df['col_a']))
         .groupby("col_a",as_index=False).sum())

print(out)

           col_a  col_b
0              1      3
1              2      7
2  non_numerical      5

Upvotes: 2

scotscotmcc
scotscotmcc

Reputation: 3113

You can accomplish this by making col_a all strings instead of trying to make them numeric. If you wanted to later add values in col_a or something, then of course you wouldn't be able to.

df['col_a2'] = df['col_a'].astype(str)
pd.pivot_table(df,index='col_a2',values='col_b',aggfunc='sum')

col_a2  col_b
1   3
2   7
non_numerical   5

If you didn't want to make col_a2, just re-write over col_a by not putting the 2 there.

Upvotes: 1

Related Questions