Reputation:
Have this data:
region gdp_per_capita
0 Coasts of USA 71 546
1 USA: New York, New Jersey 81 615
2 USA: California 74 205
3 USA: New England 74 000
Wanna get this:
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Tried to use df.columns = df.columns.str.replace(' ', '')
, but it did not work
Upvotes: 2
Views: 3107
Reputation: 885
import re
df['gdp_per_capita'] = df['gdp_per_capita'].apply(lambda x: re.sub("[^0-9]", "", str(x))).astype(int)
Upvotes: 1
Reputation: 7604
Just this should do:
df['gdp_per_capita'] = df['gdp_per_capita'].astype(str).str.replace('\s+', '').replace('nan', np.nan)
df['gdp_per_capita'] = pd.to_numeric(df['gdp_per_capita'])
print(df)
region gdp_per_capita
0 Coasts of USA 71546
1 USA: New York, New Jersey 81615
2 USA: California 74205
3 USA: New England 74000
Upvotes: 5
Reputation: 59294
Looks like you want to work with numbers rather than strings.
Hence, replacing ' '
with ''
and using pd.to_numeric
seems like an easy and solid approach.
Let me suggest another one which might or might not be good (it depends on your dataset).
If the thousands in your dataset are separated by a whitespace (' '
), you can just read your df
as
df = pd.read_csv(file, thousands = ' ')
and all your columns with 74 109
would be read as 74109
and dtype
integer or float.
Upvotes: 2
Reputation:
I am not quite sure it will work or not but try the following:
Let me know if it works :)
Upvotes: 0