Nisman
Nisman

Reputation: 1309

How to format a number with trailing dash to a negative number in pySpark?

I am reading some CSV data using Spark (by specifying schema and setting to FAILFAST mode). Data contains different column types including integers. The problem is some integers have trailing dash instead of leading dash (324- instead of -324 and spark takes them as string). Currently it fails on parsing these values as integers (and if I remove the FAILFAST mode from my code it replaces all non integers with null):

df = spark.read.format("com.databricks.spark.csv")
               .option("sep","\t")
               .option("header", header)
               .option("mode", "FAILFAST")
               .schema(schema)
               .load(path)

Is there an easy and quick way to instruct spark to load those integers as negative numbers and still keep the FAILFAST mode?

Upvotes: 1

Views: 1448

Answers (1)

chlebek
chlebek

Reputation: 2451

You can load these columns as Strings and then convert to Integers.

def castInt(col: Column) = when(instr(col,"-")>1, concat(lit("-"), trim(col,"-")))
                                   .otherwise(col).cast("INT")

df.select(castInt('column1))

Upvotes: 2

Related Questions