cs_guy
cs_guy

Reputation: 373

Replace accounting notation for negative number with minus value

I have a dataframe which contains negative numbers, with accountancy notation i.e.:

df.select('sales').distinct().show()

+------------+
|    sales   |
+------------+
|         18 |
|          3 |
|         10 |
|         (5)|
|          4 |
|         40 |
|          0 |
|          8 |
|         16 |
|         (2)|
|          2 |
|         (1)|
|         14 |
|         (3)|
|          9 |
|         19 |
|         (6)|
|          1 |
|         (9)|
|         (4)|
+------------+
only showing top 20 rows

The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.


Here is what I have tried:

sales = (
    df
    .select('sales')
    .withColumn('sales_new',
               sf.when(sf.col('sales').substr(1,1) == '(',
                       sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
               .otherwise(sf.col('sales')))
    
)

sales.show(20,False)

+---------+---------+
|salees   |sales_new|
+---------+---------+
| 151     | 151     |
| 134     | 134     |
| 151     | 151     |
|(151)    |-151     |
|(134)    |-134     |
|(151)    |-151     |
| 151     | 151     |
| 50      | 50      |
| 101     | 101     |
| 134     | 134     |
|(134)    |-134     |
| 46      | 46      |
| 151     | 151     |
| 134     | 134     |
| 185     | 185     |
| 84      | 84      |
| 188     | 188     |
|(94)     |-94)     |
| 38      | 38      |
| 21      | 21      |
+---------+---------+

The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.


I have tried using regexp_replace but get an error that:

PatternSyntaxException: Unclosed group near index 1

sales = (
    df
    .select('sales')
    .withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)

Upvotes: 0

Views: 416

Answers (2)

cs_guy
cs_guy

Reputation: 373

This can be solved with a case statement and regular expression together:

from pyspark.sql.functions import regexp_replace, col

sales = (
    df
    .select('sales')
    .withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
                sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
                .otherwise(sf.col('sales')))
)

sales.show(20,False)

+---------+---------+
|sales    |sales_new|
+---------+---------+
|151      |151      |
|134      |134      |
|151      |151      |
|(151)    |-151     |
|(134)    |-134     |
|(151)    |-151     |
|151      |151      |
|50       |50       |
|101      |101      |
|134      |134      |
|(134)    |-134     |
|46       |46       |
|151      |151      |
|134      |134      |
|185      |185      |
|84       |84       |
|188      |188      |
|(94)     |-94      |
|38       |38       |
|21       |21       |
+---------+---------+

Upvotes: 2

MD Mushfirat Mohaimin
MD Mushfirat Mohaimin

Reputation: 2066

You can slice the string from the second character to the second last character, and then convert it to float, for example:

def convert(number):
    try:
        number = float(number)
    except:
        
        number = number[1:-1]
        number = float(number)
        return number

You can iterate through all the elements and apply this function.

Upvotes: 0

Related Questions