Reputation: 2734
I posted this as a comment in this semi-related question but I felt it needed a post of its own.
Does anyone know where you can find a list of the valid strings to pass to the dataType
argument of cast()
? I've looked and I find things like this or this but none of them are explicitly answering the question.
Also, I've found through trial-and-error that you can pass things like bigint
or tinyint
and those seem to work, though they are nowhere listed as valid Spark data types, at least not that I can find. Any ideas?
For some reproducibility:
df = spark.createDataFrame(
[
[18786, "attr1", 0.9743],
[65747, "attr1", 0.4568],
[56465, "attr1", 0.6289],
[18786, "attr2", 0.2976],
[65747, "attr2", 0.4869],
[56465, "attr2", 0.8464],
],
["id", "attr", "val"],
)
print(df)
This gives you DataFrame[id: bigint, attr: string, val: double]
, I guess by inferring the schema by default.
Then you can do something like this to re-cast the types:
from pyspark.sql.functions import col
fielddef = {'id': 'smallint', 'attr': 'string', 'val': 'long'}
df = df.select([col(c).cast(fielddef[c]) for c in df.columns])
print(df)
And now I get DataFrame[id: smallint, attr: string, val: bigint]
so apparently 'long'
converts to 'bigint'
. I'm sure there are other conversions like that.
Also, I had this weird feeling that it would just silently ignore invalid strings you pass it, but this is not true. When I tried passing 'attr': 'varchar'
in the fielddef
dict I got an DataType varchar is not supported...
error.
Any help is much appreciated!
Upvotes: 1
Views: 9464
Reputation: 1380
This is kind of tricky to answer definitively since Spark supports complex types (Maps, Arrays, Structs) of arbitrary complexity, as well as user-defined types. For practical purposes, DataTypeParserSuite.scala has a pretty comprehensive set of examples for primitive and complex types.
For primitive types, I've adapted this list from the visitPrimitiveDataType
method of AstBuilder.scala
Complex types are then combinations of themselves and primitive types, e.g. struct<col1 : timestamp, col2 : bigint, col3 : map<string,array<double>>>
Upvotes: 1