Reputation: 551
I have data where the dimension is inconsistent between numeric and string formats, and some of the numeric formats even show up as a string.
Unfortunately astype()
and to_numeric()
won't work for me here, as you'll see why with the sample data:
import pandas as pd
data = {'col_a': [1, '1', 2, '2', 'non_numerical'], 'col_b': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data, columns=['col_a', 'col_b'])
I would like to aggregate the data, like this:
col_a col_b
1 3
2 7
'non_numerical' 5
I'm thinking that I need to write a for loop with an if else statement to check and convert data types, but I'm hoping there's a simpler way with python and/or pandas.
Thanks!
Upvotes: 1
Views: 1263
Reputation: 23217
You can firstly use .assign()
to create a temporary column of col_a
with conversion to string type and with the same column name. Use this new temporary column for .groupby()
and aggregation with .sum()
, as follows:
df.assign(col_a=df['col_a'].astype(str)).groupby('col_a', as_index=False).sum()
In this way, you can get the aggregation result without modifying the original contents of col_a
since we only convert the column to string on the temporary column without touching the original column.
Result:
col_a col_b
0 1 3
1 2 7
2 non_numerical 5
Upvotes: 1
Reputation: 75080
to_numeric
should work with coercing + fillna:
out = (df.assign(col_a=pd.to_numeric(df['col_a'],errors='coerce').fillna(df['col_a']))
.groupby("col_a",as_index=False).sum())
print(out)
col_a col_b
0 1 3
1 2 7
2 non_numerical 5
Upvotes: 2
Reputation: 3113
You can accomplish this by making col_a
all strings instead of trying to make them numeric. If you wanted to later add values in col_a or something, then of course you wouldn't be able to.
df['col_a2'] = df['col_a'].astype(str)
pd.pivot_table(df,index='col_a2',values='col_b',aggfunc='sum')
col_a2 col_b
1 3
2 7
non_numerical 5
If you didn't want to make col_a2
, just re-write over col_a
by not putting the 2 there.
Upvotes: 1