convert string column to double with period as decimal point and thousands separator

Question

I'am a writing a program to fix database column values (CSV input) with numbers joined from varios datasources.

The problem is that they have both thousands separator and decimal separator using a dot "."

Eg:

100.000.000 --> 10000000
9.000 --> 9000
0.600000 --> 0.6 (DECIMAL)
12.000.000 --> 12000000
0.200000 --> 0.2 (DECIMAL)
1.000 --> 1000
700.000 --> 700000

I've tried with conditional regex, but I did not found the proper way, so I had to do it in two stages ('when', and regex extract):

import sys
from pyspark import SQLContext, SparkContext
from pyspark.sql.types import StructType,StructField,StringType,DateType,IntegerType
sc = SparkContext()
sqlContext = SQLContext(sc)

from pyspark.sql.functions import regexp_replace, regexp_extract, col, when
df3 = sqlContext.createDataFrame([["100.000.000"],["9.000"],["0.600000"],["12.000.000"],["0.200000"],["1.000"],["700.000"]],["SENSOR_VALUE"])

df3.show()

regexDotDig6 = "\.(\d\d\d\d\d\d)";
regexDotDig3 = "\.(\d\d\d)";

df3 = df3.withColumn("SENSOR_VALUE", when(regexp_extract(df3.SENSOR_VALUE, regexDotDig6,1) == "",                                    regexp_replace(df3.SENSOR_VALUE, regexDotDig3, "$1"))                                    .otherwise(df3.SENSOR_VALUE).cast('double'))

df3.show()

Is there a way to do it in the same regex, or in more efficient way?

pault · Accepted Answer

You can use pyspark.sql.Column.startswith to check if the string starts with a "0". If yes, cast it to a double. If no, replace the periods with empty string and then cast.

from pyspark.sql.functions import col, when, regexp_replace

df3.withColumn(
    "SENSOR_VALUE",
    when(
        col("SENSOR_VALUE").startswith("0"),
        col("SENSOR_VALUE")
    ).otherwise(regexp_replace("SENSOR_VALUE", "\.", "")).cast("double")
).show()
#+------------+
#|SENSOR_VALUE|
#+------------+
#|       1.0E8|
#|      9000.0|
#|         0.6|
#|       1.2E7|
#|         0.2|
#|      1000.0|
#|    700000.0|
#+------------+

convert string column to double with period as decimal point and thousands separator

Answers (2)

Related Questions