Reputation: 1564
A toy example works fine, where its schema is defined using a static definition. The dynamically defined schema throws error, but why, and how to fix? They seem identical.
Statically defined:
XXX = sc.parallelize([('kygiacomo', 0, 1), ('namohysip', 1, 0)])
schema = StructType([
StructField("username",StringType(),True),
StructField("FanFiction",IntegerType(),True),
StructField("nfl",IntegerType(),True)])
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show()
Output which is good:
StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true)))
+---------+----------+---+
| username|FanFiction|nfl|
+---------+----------+---+
|kygiacomo| 0| 1|
|namohysip| 1| 0|
+---------+----------+---+
Dynamically-defined:
print(XXX.collect())
username_field = [StructField('username', StringType(), True)]
int_fields = [StructField(str(i), IntegerType(), True) for i in itemids.keys()]
schema = StructType(username_field + int_fields)
print(schema)
df = sess.createDataFrame(XXX, schema)
df.show()
Output which throws an error on df.show:
[('kygiacomo', 0, 1, 0, 0, 0, 0), ('namohysip', 1, 0, 0, 0, 0, 0), ('immortalis', 0, 1, 0, 0, 0, 0), ('403and780', 0, 0, 0, 0, 0, 1), ('SDsc0rch', 0, 0, 0, 1, 0, 0), ('shitpostlord4321', 0, 0, 0, 0, 1, 0), ('scarletcrawford', 0, 0, 1, 0, 0, 0)]
StructType(List(StructField(username,StringType,true),StructField(FanFiction,IntegerType,true),StructField(nfl,IntegerType,true),StructField(alteredcarbon,IntegerType,true),StructField(The_Donald,IntegerType,true),StructField(marvelstudios,IntegerType,true),StructField(hockey,IntegerType,true)))
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
...
TypeError: field FanFiction: IntegerType can not accept object 0 in type <class 'numpy.int64'>
I cannot see what the code is doing differently. Can you? Thanks.
Upvotes: 0
Views: 1606
Reputation:
Based on the exception:
TypeError: field FanFiction: IntegerType can not accept object 0 in type
your data contains objects of type numpy.int64
.
Spark SQL API in general doesn't support numpy
types (check SPARK-12157, SPARK-6857 and other JIRA tickets).
The problem is not reproducible with examples you have provided.
But it is consistent with your previous questions (How to create a tuple from list or array without code generation in pyspark, How to convert numpy array elements to spark RDD column values), which clearly show you extract data from pyspark.mllib.linalg.distributed.IndexedRowMatrix
/ pyspark.mllib.linalg.Vector
.
By default all value extracted from Vectors
are represented using numpy
types.
Hence the error.
Now, the answer to your previous question already shows one of the possible solutions - convert data to standard Python types using tolist
.
Alternatively convert each entry directly calling corresponding builtins
functions (int
, float
on each record in the row).
Upvotes: 1