Reputation: 47
I'am a writing a program to fix database column values (CSV input) with numbers joined from varios datasources.
The problem is that they have both thousands separator and decimal separator using a dot "."
Eg:
I've tried with conditional regex, but I did not found the proper way, so I had to do it in two stages ('when', and regex extract):
import sys
from pyspark import SQLContext, SparkContext
from pyspark.sql.types import StructType,StructField,StringType,DateType,IntegerType
sc = SparkContext()
sqlContext = SQLContext(sc)
from pyspark.sql.functions import regexp_replace, regexp_extract, col, when
df3 = sqlContext.createDataFrame([["100.000.000"],["9.000"],["0.600000"],["12.000.000"],["0.200000"],["1.000"],["700.000"]],["SENSOR_VALUE"])
df3.show()
regexDotDig6 = "\.(\d\d\d\d\d\d)";
regexDotDig3 = "\.(\d\d\d)";
df3 = df3.withColumn("SENSOR_VALUE", when(regexp_extract(df3.SENSOR_VALUE, regexDotDig6,1) == "", regexp_replace(df3.SENSOR_VALUE, regexDotDig3, "$1")) .otherwise(df3.SENSOR_VALUE).cast('double'))
df3.show()
Is there a way to do it in the same regex, or in more efficient way?
Upvotes: 0
Views: 940
Reputation: 43504
You can use pyspark.sql.Column.startswith
to check if the string starts with a "0"
. If yes, cast it to a double
. If no, replace the periods with empty string and then cast.
from pyspark.sql.functions import col, when, regexp_replace
df3.withColumn(
"SENSOR_VALUE",
when(
col("SENSOR_VALUE").startswith("0"),
col("SENSOR_VALUE")
).otherwise(regexp_replace("SENSOR_VALUE", "\.", "")).cast("double")
).show()
#+------------+
#|SENSOR_VALUE|
#+------------+
#| 1.0E8|
#| 9000.0|
#| 0.6|
#| 1.2E7|
#| 0.2|
#| 1000.0|
#| 700000.0|
#+------------+
Upvotes: 1
Reputation: 47
This is not an optimal solution I managed to...
from pyspark.sql.functions import when, regexp_extract, regexp_replace
df3 = df3.withColumn(
"SENSOR_VALUE",
when(
regexp_extract(df3.SENSOR_VALUE, regexDotDig6,1) == "",
regexp_replace(df3.SENSOR_VALUE, regexDotDig3, "$1")
).otherwise(df3.SENSOR_VALUE).cast('double'))
df3.show()
+------------+
|SENSOR_VALUE|
+------------+
| 1.0E8|
| 9000.0|
| 0.6|
| 1.2E7|
| 0.2|
| 1000.0|
| 700000.0|
+------------+
Upvotes: 0