Reputation: 2016
Any ideas on this one in Pyspark?
I have salaries like the below in the Salary column. I've tried to remove the $
df = df.withColumn('clean_salary', regexp_replace(col("Salary"), '$', ''))
df.show()
It doesn't do anything, as you can see - any ideas why?
Thanks
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| id|first_name| last_name|gender| City| Job Title| Salary| Latitude| Longitude|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+------------+
| 1| Melinde| Shilburne|Female| Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184| $57438.18|
| 2| Kimberly|Von Welden|Female| Bulgan| Programmer II|$62846.60|48.8231572|103.5218199| $62846.60|
| 3| Alvera| Di Boldi|Female| null| null|$57576.52|39.9947462|116.3397725| $57576.52|
| 4| Shannon| O'Griffin| Male| Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171| $61489.23|
| 5| Sherwood| Macieja| Male| Mytishchi| VP Sales|$63863.09| null| 37.6489954| $63863.09|
| 6| Maris| Folk|Female|Kinsealy-Drinan| Civil Engineer|$30101.16|53.4266145| -6.1644997| $30101.16|
| 7| Masha| Divers|Female| Dachun| null|$25090.87| 24.879416| 118.930111| $25090.87|
| 8| Goddart| Flear| Male| Trélissac|Desktop Support T...|$46116.36|45.1905186| 0.7423124| $46116.36|
| 9| Roth|O'Cannavan| Male| Heitan|VP Product Manage...|$73697.10| 32.027934| 106.657113| $73697.10|
Upvotes: 0
Views: 476
Reputation: 11244
Rather than regex, it's easier to just remove the first character (unless salary column values are not that straightforward)
>>> df = sc.parallelize([('$123',),('$873',)]).toDF(['salary'])
>>> df.show()
+------+
|salary|
+------+
| $123|
| $873|
+------+
>>> df.select(df.salary.substr(2,100).cast('float').alias('salary')).show() #Float
+------+
|salary|
+------+
| 123.0|
| 873.0|
+------+
>>> df.select(df.salary.substr(2,100).cast('decimal(10,2)').alias('salary')).show() #Decimal
+------+
|salary|
+------+
|123.00|
|873.00|
+------+
Upvotes: 1
Reputation: 5103
try the below regexp_replace code
updatedDF = df.withColumn('clean_salary', regexp_replace(col("Salary"), "[\$]", ""))
updatedDF.show()
Upvotes: 0