user3476463
user3476463

Reputation: 4575

get first numeric values from pyspark dataframe string column into new column

I have a pyspark dataframe like the input data below. I would like to create a new column product1_num that parses the first numeric in each record in the productname column, in to a new column. I have example output data below. I'm not sure what's available in pyspark as far as string split and regex matching. Can anyone suggest how to do this with pyspark?

input data:

+------+-------------------+
|id    |productname        |
+------+-------------------+
|234832|EXTREME BERRY SAUCE|
|419836|BLUE KOSHER SAUCE  |
|350022|GUAVA (1G)         |
|123213|GUAVA 1G           |
+------+-------------------+

output:

+------+-------------------+-------------+
|id    |productname        |product1_num |
+------+-------------------+-------------+
|234832|EXTREME BERRY SAUCE|             |
|419836|BLUE KOSHER SAUCE  |             |
|350022|GUAVA (1G)         |1            |
|123213|GUAVA G5           |5            |
|125513|3GULA G5           |3            |
|127143|GUAVA G50          |50           |
|124513|LAAVA C2L5         |2            |
+------+-------------------+-------------+

Upvotes: 1

Views: 1582

Answers (1)

murtihash
murtihash

Reputation: 8410

You can use regexp_extract:

from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()

+------+-------------------+------------+
|    id|        productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE|            |
|419836|  BLUE KOSHER SAUCE|            |
|350022|         GUAVA (1G)|           1|
|123213|           GUAVA G5|           5|
|125513|           3GULA G5|           3|
|127143|          GUAVA G50|          50|
|124513|         LAAVA C2L5|           2|
+------+-------------------+------------+

Upvotes: 1

Related Questions