Geoffrey Anderson
Geoffrey Anderson

Reputation: 1564

Dynamic dataframe schema construction in Apache Pyspark v2.3.0

A toy example works fine, where its schema is defined using a static definition. The dynamically defined schema throws error, but why, and how to fix? They seem identical.

Statically defined:

XXX = sc.parallelize([('kygiacomo', 0, 1), ('namohysip', 1, 0)])
schema = StructType([
    StructField("username",StringType(),True),
    StructField("FanFiction",IntegerType(),True),
    StructField("nfl",IntegerType(),True)])
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show() 

Output which is good:

StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true)))
+---------+----------+---+
| username|FanFiction|nfl|
+---------+----------+---+
|kygiacomo|         0|  1|
|namohysip|         1|  0|
+---------+----------+---+

Dynamically-defined:

print(XXX.collect())
username_field = [StructField('username', StringType(), True)]
int_fields = [StructField(str(i), IntegerType(), True) for i in itemids.keys()]
schema = StructType(username_field + int_fields)
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show()

Output which throws an error on df.show:

[('kygiacomo', 0, 1, 0, 0, 0, 0), ('namohysip', 1, 0, 0, 0, 0, 0), ('immortalis', 0, 1, 0, 0, 0, 0), ('403and780', 0, 0, 0, 0, 0, 1), ('SDsc0rch', 0, 0, 0, 1, 0, 0), ('shitpostlord4321', 0, 0, 0, 0, 1, 0), ('scarletcrawford', 0, 0, 1, 0, 0, 0)]
StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true),StructField(alteredcarbon,IntegerType,true),StructField(The_Donald,IntegerType,true),StructField(marvelstudios,IntegerType,true),StructField(hockey,IntegerType,true)))

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
...
TypeError: field FanFiction: IntegerType can not accept object 0 in type <class 'numpy.int64'>

I cannot see what the code is doing differently. Can you? Thanks.

Upvotes: 0

Views: 1606

Answers (1)

user9947746
user9947746

Reputation:

Now, the answer to your previous question already shows one of the possible solutions - convert data to standard Python types using tolist.

Alternatively convert each entry directly calling corresponding builtins functions (int, float on each record in the row).

Upvotes: 1

Related Questions