Reputation: 373
I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in ()
are negative. How can I replace them to have minus values instead i.e. (5)
becomes -5
and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace
but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
Upvotes: 0
Views: 416
Reputation: 373
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
Upvotes: 2
Reputation: 2066
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.
Upvotes: 0