seth127
seth127

Reputation: 2734

pyspark: Valid strings to pass to dataType arg of cast()

I posted this as a comment in this semi-related question but I felt it needed a post of its own.

Does anyone know where you can find a list of the valid strings to pass to the dataType argument of cast()? I've looked and I find things like this or this but none of them are explicitly answering the question.

Also, I've found through trial-and-error that you can pass things like bigint or tinyint and those seem to work, though they are nowhere listed as valid Spark data types, at least not that I can find. Any ideas?

For some reproducibility:

df = spark.createDataFrame(
    [
        [18786, "attr1", 0.9743],
        [65747, "attr1", 0.4568],
        [56465, "attr1", 0.6289],
        [18786, "attr2", 0.2976],
        [65747, "attr2", 0.4869],
        [56465, "attr2", 0.8464],
    ],
    ["id", "attr", "val"],
)
print(df)

This gives you DataFrame[id: bigint, attr: string, val: double], I guess by inferring the schema by default.

Then you can do something like this to re-cast the types:

from pyspark.sql.functions import col

fielddef = {'id': 'smallint', 'attr': 'string', 'val': 'long'}
df = df.select([col(c).cast(fielddef[c]) for c in df.columns])
print(df)

And now I get DataFrame[id: smallint, attr: string, val: bigint] so apparently 'long' converts to 'bigint'. I'm sure there are other conversions like that.

Also, I had this weird feeling that it would just silently ignore invalid strings you pass it, but this is not true. When I tried passing 'attr': 'varchar' in the fielddef dict I got an DataType varchar is not supported... error.

Any help is much appreciated!

Upvotes: 1

Views: 9464

Answers (1)

Charlie Flowers
Charlie Flowers

Reputation: 1380

This is kind of tricky to answer definitively since Spark supports complex types (Maps, Arrays, Structs) of arbitrary complexity, as well as user-defined types. For practical purposes, DataTypeParserSuite.scala has a pretty comprehensive set of examples for primitive and complex types.

For primitive types, I've adapted this list from the visitPrimitiveDataType method of AstBuilder.scala

  • "boolean" -> BooleanType
  • "tinyint" | "byte" -> ByteType
  • "smallint" | "short" -> ShortType
  • "int" | "integer" -> IntegerType
  • "bigint" | "long" -> LongType
  • "float" -> FloatType
  • "double" -> DoubleType
  • "date" -> DateType
  • "timestamp" -> TimestampType
  • "string" | "char(x)" | "varchar(x)" -> StringType
  • "binary" -> BinaryType
  • "decimal" | "decimal(x)" | "decimal(x.y)" -> DecimalType

Complex types are then combinations of themselves and primitive types, e.g. struct<col1 : timestamp, col2 : bigint, col3 : map<string,array<double>>>

Upvotes: 1

Related Questions