Reputation: 1984
I'm reading in a very large CSV with over 200 columns. Some of the columns are completely empty. When I read this as a dataframe, it forces these columns to be type float64.
I force it to be a string with:
if df['OtherValidationAuthority5ValidationAuthorityEntityID'].dtype == 'float64':
df['OtherValidationAuthority5ValidationAuthorityEntityID'] =
df['OtherValidationAuthority5ValidationAuthorityEntityID'].astype(str)
The problem then is that when I print out that column, all the values are nan. They need to be null strings, so I use
df['OtherValidationAuthority5ValidationAuthorityEntityID'] =
df['OtherValidationAuthority5ValidationAuthorityEntityID'].replace(np.nan, '', regex=True)
Put then I print the column and they're still nans! I've seen other examples on stackoverflow and it SEEMS like this is what they're recommending. Has something changed with versions? (I'm using 3.7).
What am I missing?
Addendum. I use this code to change the columns. It works for some, but not others.
for colname in df.columns:
if df[colname].dtype == 'float64' or df[colname].dtype == 'int64':
df[colname] = df[colname].astype(str)
df[colname] = df[colname].replace({'nan': ''})
When I print dtypes, they're all 'object', as I expect, but when I print the values they're
('001GPB6A9XPE8XJICC14', 'FIDELITY ADVISOR SERIES I - Fidelity Advisor Leveraged Company Stock Fund', nan, nan, nan, nan, nan, '', nan, nan, '', nan, nan, '', nan, '', '', '', nan, nan, nan, '', '', '', '', '', '', '', '', '', '', '', '', nan, '245 SUMMER STREET', '', nan, '', nan, nan, nan, 'BOSTON', 'US-MA', 'US', '02110', nan, '245 Summer Street', '', nan, '', nan, nan, nan, 'Boston', 'US-MA', 'US', '02210', nan, nan, nan, '', '', '', nan, '', '', nan, nan, nan, nan, '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'RA000665', nan, 'S000005113', 'US-MA', 'FUND', '8888', 'OTHER', nan, nan, '', '', 'ACTIVE', nan, nan, nan, nan, '', '2012-11-29T16:33:00.000Z', '2020-06-03T14:33:00.000Z', 'ISSUED', '2021-05-29T07:50:00.000Z', 'EVK05KS7XY1DEII3R011', 'FULLY_CORROBORATED', 'RA000665', nan, 'S000005113', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')
Upvotes: 0
Views: 727
Reputation: 323226
change the replace
line
df['xxxx'] = df['xxxx'].replace({'nan': ''})
Upvotes: 1