Marcos Dias
Marcos Dias

Reputation: 450

How to split the text in a pyspark column using a delimiter?

I have a column in my pyspark dataframe which contains the price of my products and the currency they are sold in. As 99% of the products are sold in dollars, let's use the dollar example.

products_price
+---------------+-----------+ 
| product_id    | price     |
+---------------+-----------+
| 0001          | 10|USD    |
| 0002          | 19.9|USD  |
| 0003          | 14.45|USD |
| 0004          | 17.75|USD |
| 0005          | 98.99|USD |
| 0006          | 5.60|USD  |
| 0007          | 20.50|USD |
+---------------+-----------+

I tried a couple of things like this:

from pyspark.sql.functions import split
 
products_price = (
   products_price
  .withColumn("new_price", split(col("price"), "|").getItem(0)
)

But nothing works. This snippet above just return the first character of the price column. It's weird because some people said it worked. I just need to remove the |USD and leave the numbers. Could you guys please help me with this?

Upvotes: 1

Views: 6712

Answers (1)

walking
walking

Reputation: 960

the split function uses regular expression and | is a reserved symbol in regex. you should use split("price", "\|") to get what you need.

Upvotes: 1

Related Questions