Reputation: 7587

PySpark: How to specify column with comma as decimal

I am working with PySpark and loading a csv file. I have a column with numbers in European format, which means that comma replaces the dot and vice versa.

For example: I have 2.416,67 instead of 2,416.67.

My data in .csv file looks like this -    
ID;    Revenue
21;    2.645,45
23;   31.147,05
.
.
55;    1.009,11

In pandas, such a file can easily be read by specifying decimal=',' and thousands='.' options inside pd.read_csv() to read European formats.

Pandas code:

import pandas as pd
df=pd.read_csv("filepath/revenues.csv",sep=';',decimal=',',thousands='.')

I don't know how can this be done in PySpark.

PySpark code:

from pyspark.sql.types import StructType, StructField, FloatType, StringType
schema = StructType([
            StructField("ID", StringType(), True),
            StructField("Revenue", FloatType(), True)
                    ])
df=spark.read.csv("filepath/revenues.csv",sep=';',encoding='UTF-8', schema=schema, header=True)

Can anyone suggest as to how we can load such a file in PySpark using the above mentioned .csv() function?

Upvotes: 16

Answers (3)

Sander Vanden Hautte

Reputation: 2543

If your dataset has lots of float columns, but the size of the dataset is still small enough to preprocess it first with pandas, I found it easier to just do the following.

import pandas as pd

df_pandas = pd.read_csv('yourfile.csv', sep=';', decimal=',')
df_pandas.to_csv('yourfile__dot_as_decimal_separator.csv', sep=';', decimal='.') # optionally also header=True of course.

df_spark = spark.csv.read('yourfile__dot_as_decimal_separator.csv', sep=';', inferSchema=True) # optionally also header=True of course.

I did find jhole89's answer very useful, but found it a pain to apply it on a dataset with a lot of columns (multiple hundreds).

I mean:

manually specifying float columns and converting them is a lot of effort,
trying to find them dynamically by checking which columns are string-typed and contain a comma, avoiding that datetime columns with millesecond separators aren't taken into account etc., casting to float that fails on certain columns because they are text containing comma's but aren't intended to be parsed as float numbers: this causes headaches.

Therefore, if there are multiple float columns and your dataset can be preprocessed with pandas, you can apply the above code.

Upvotes: 1

braga461

Reputation: 41

Make sure your SQL table is pre-formatted to read NUMERIC instead of INTEGER. I had a big trouble trying to figure out all about encoding and the different formats of dots and commas and etc. and in the end the problem was much more primitive, it was pre-formatted to read only INTEGER numbers, and therefore no decimals would ever be accepted, no matter if with commas or dots. Then I just had to change my SQL table to accept real numbers (NUMERIC) instead and that was it.

Upvotes: -1

jhole89

Reputation: 828

You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))

You can probably just chain these all together too:

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
         df
         .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
         .withColumn('revenue', regexp_replace('revenue', ',', '.'))
         .withColumn('revenue', df['revenue'].cast("float"))
     )

Please note this I haven't tested this so there may be a typo or two in there.

Upvotes: 18

PySpark: How to specify column with comma as decimal

Answers (3)

Related Questions