Reputation: 1
I have a large dataset with 400columns and 30,000 rows. The dataset is all numerical but some columns have weird string values in them (denoted as "#?") instead of being blank. This changes the dtypes of the columns that have "#?" into object type. (150 columns object dtype)
I need to convert all the columns into float or int dtypes, and then fill the normal NaN values in the data, with means of a column's groups. (e.g: means of X, means of Y in each column)
col1 col2 col3
X 21 32
X NaN 3
Y Nan 5
My end goal is to apply this to the entire data:
df.groupby("col1").transform(lambda x: x.fillna(x.mean()))
But I can't apply this for the columns that have "#?" in them, they get dropped. I tried replacing the #? with a numerical value, and then convert all the columns into float dtype, which works, but the replaced values also should be included in the above code.
I thought about replacing #? with an weird value like -123.456 so that it doesn't get mixed with actual data points, and maybe replace all the -123.456 with the means of column groups but the -123.456 would need to be excluded from the mean. But I just don't know how that would even work. If I convert it back to NaN again, the dtype changes back to object.
I think the best way to go about it would be directly replacing the #? with the column group means.
Any ideas?
edit: I'm so dumb lol
df=df.replace('#?', '').astype(float, errors = 'ignore')
this works.
Upvotes: 0
Views: 50
Reputation: 862601
Use:
print (df)
col1 col2 col3
0 X 21 32
1 X #? 3
2 Y NaN 5
df = (df.set_index('col1')
.replace(r'#\?', np.nan, regex=True)
.astype(float)
.groupby("col1")
.transform(lambda x: x.fillna(x.mean())))
print (df)
col2 col3
col1
X 21.0 32.0
X 21.0 3.0
Y NaN 5.0
Upvotes: 1