kasturi76
kasturi76

Reputation: 35

How to extract the numeric part from a string column in spark?

I am new to spark and trying to play with data to get practice. I am using databricks in scala and for dataset I am using fifa 19 complete player dataset from kaggle. one of the column named "Weight" which contains the data that looks like

    +------+
    |Weight|
    +------+
    |136lbs|
    |156lbs|
    |136lbs|
    |...   |
    |...   |
    +------+

I want to change the column in such a way to look like this

    +------+
    |Weight|
    +------+
    |136   |
    |156   |
    |136   |
    |...   |
    |...   |
    +------+

Can any one help how I can change the column values in spark sql.

Upvotes: 1

Views: 9506

Answers (2)

abiratsis
abiratsis

Reputation: 7316

Here is another way using regex and the regexp_extract build-in function:

import org.apache.spark.sql.functions.regexp_extract

val df = Seq(
"136lbs",
"150lbs",
"12lbs",
"30kg",
"500kg")
.toDF("weight")

df.withColumn("weight_num", regexp_extract($"weight", "\\d+", 0))
  .withColumn("weight_unit", regexp_extract($"weight", "[a-z]+", 0))
  .show

//Output
+------+----------+-----------+
|weight|weight_num|weight_unit|
+------+----------+-----------+
|136lbs|       136|        lbs|
|150lbs|       150|        lbs|
| 12lbs|        12|        lbs|
|  30kg|        30|         kg|
| 500kg|       500|         kg|
+------+----------+-----------+

Upvotes: 4

uh_big_mike_boi
uh_big_mike_boi

Reputation: 3470

You can create a new column and use regexp_replace

dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))

Upvotes: 1

Related Questions