Reputation: 199
Sometimes the string numbers in my DataFrames have commas in them representing either decimal or marking the thousand, some do not. The dataframe is an example of the range of price formats I receive via an API and vary depend on the currency. These are prices and the decimals will always be 2. So I need to output the string prices into float so I can sum them or separate them into other dataframes or use them for plotting graphs. I have created a loop to replace them, but is there a quicker way to do this without the loop?
My DataFrame and working loop is as follows:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
fees = []
sales = []
for items in df['amount']:
if items[-7:-6] == ',':
items = float(items.replace(',', '').replace(' ',''))
if items[-3:-2] == ',':
items = float(items.replace(',', '.').replace(' ',''))
items = float(items)
if items <= 0:
fees.append(items)
else:
sales.append(items)
I have attempted to do this without the loop but can't seem to work out where I have gone wrong.
df["amount"] = np.where((df['amount'][-7:-6] == ','),
df["amount"][-7:-6].str.replace(',', '').replace(' ',''),
df["amount"])
df["amount"] = np.where((df['amount'][-3:-2] == ','),
df["amount"][-3:-2].str.replace(',', '').replace(' ',''),
df["amount"])
Any help would be much appreciated. Thank you in advance
Upvotes: 2
Views: 4306
Reputation: 1640
Since you mention the last two digits are decimal points, so the ','
needs to be replaced with '.'
to make it float
, but you also have some values like 1,000.00
that will become irrelevant if the ','
is replaced with '.'
, hence you can use a regex
to identify what values to be replaced:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
df
First the regex
will match all string with ','
and two decimal points, then the replace function will replace the match with a '.'
& the captured values (99 from ,99
)
df['amount'] = df['amount'].str.replace(r'(,)(\d{2}$)',r'.\2')
# here `r'.\2'`is second `captured group` in `regex`
Then to convert 1,000.00
to float
we will replace the ','
with blank
df['amount'] = df['amount'].str.replace(',','')
And then convert the data type to float
df['amount'] = df['amount'].astype(float)
print(df)
amount
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00
Upvotes: 2
Reputation: 10960
Try using split and join,
df.amount.str.split(',').str.join('').astype(float)
Output
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00
Name: amount, dtype: float64
Upvotes: 0
Reputation: 1281
You can use lambdas instead of numpy:
lambda1 = lambda items: float(str(items).replace(',', '').replace(' ','')) if str(items)[-7:-6] == ',' else items
lambda2 = lambda items: float(str(items).replace(',', '.').replace(' ','')) if str(items)[-3:-2] == ',' else items
to_float = lambda items: float(items)
df['amount_clean'] = df["amount"].map(lambda1).map(lambda2).map(to_float)
=========================================================================
Edit: what are lambdas
In python, lambda functions are small anonymous functions with a single expression (see https://www.w3schools.com/python/python_lambda.asp)
Example with condition:
lambda x: x + 1 if x < 0 else x
This is equivalent to:
def my_lambda_function(x):
if x < 0:
return x + 1
else:
return x
When passed to the column of a pandas dataframe via the map
function, the lambda expression will be applied to the value in each row of the column.
Hope this helps!
Upvotes: 1