Reputation: 1114
I have a pandas DataFrame consisting of one column of integers and another column of numpy arrays
DataFrame({'col_1':[1434,3046,3249,3258], 'col_2':[np.array([1434, 1451, 1467]),np.array([3046, 3304]),
np.array([3249, 3246, 3298, 3299, 3220]),np.array([3258, 3263, 3307])]})
col_1 col_2
0 1434 [1434, 1451, 1467]
1 3046 [3046, 3304]
2 3249 [3249, 3246, 3298, 3299, 3220]
3 3258 [3258, 3263, 3307]
that I want to convert to a Spark DataFrame in the following format
df=sc.parallelize([ [1434,[1434, 1451, 1467]],
[3046,[3046, 3304]],
[3249,[3046, 3304]],
[3258,[3258, 3263, 3307]]]).toDF(['col_1','col_2'])
df.select('col_1',explode(col('col_2')).alias('col_2')).show(14)
+-----+-----+
|col_1|col_2|
+-----+-----+
| 1434| 1434|
| 1434| 1451|
| 1434| 1467|
| 3046| 3046|
| 3046| 3304|
| 3249| 3046|
| 3249| 3304|
| 3258| 3258|
| 3258| 3263|
| 3258| 3307|
+-----+-----+
if I attempt to convert the pandas DataFrame directly into a Spark DataFrame I get the error
not supported type: <type 'numpy.ndarray'>
Upvotes: 3
Views: 5799
Reputation: 5389
I guess one way is to convert each row in DataFrame to list of integer.
df.col_2 = df.col_2.map(lambda x: [int(e) for e in x])
Then, convert it to Spark DataFrame directly
df_spark = spark.createDataFrame(df)
df_spark.select('col_1', explode(col('col_2')).alias('col_2')).show(14)
Upvotes: 6