Kashif
Kashif

Reputation: 3327

How do I convert a numpy array to a pyspark dataframe?

I want to convert my results1 numpy array to a dataframe. For the record, results1 looks like

array([(1.0, 0.1738578587770462), (1.0, 0.33307021689414978),
       (1.0, 0.21377330869436264), (1.0, 0.443511435389518738),
       (1.0, 0.3278091162443161), (1.0, 0.041347454154491425)]).

I want to convert the above to a pyspark RDD with columns labeled "limit" (the first value in the tuple) and "probability" (the second value in the tuple).

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('YKP').getOrCreate()
sc=spark.sparkContext
# Convert list to RDD
rdd = sc.parallelize(results1)

# Create data frame
df = sc.createDataFrame(rdd)

I keep getting the error

AttributeError: 'RemoteContext' object has no attribute 'createDataFrame'

when I run this. I don't see why this is giving me an error and how do I fix this?

Upvotes: 0

Views: 4064

Answers (2)

F4RZ4D
F4RZ4D

Reputation: 125

The simplest way is:

df = rdd.map(lambda x: (x, )).toDF()
df.show()

You can also refer to this post for more details: Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Upvotes: 0

Cena
Cena

Reputation: 3419

Use map() and toDF() instead.

import numpy as np

results1 = np.array([(1.0, 0.1738578587770462), (1.0, 0.33307021689414978),
       (1.0, 0.21377330869436264), (1.0, 0.443511435389518738),
       (1.0, 0.3278091162443161), (1.0, 0.041347454154491425)])

df = sc.parallelize(results1).map(lambda x: [float(i) for i in x])\
        .toDF(["limit", "probability"])

df.show()
+-----+--------------------+                                                    
|limit|         probability|
+-----+--------------------+
|  1.0|  0.1738578587770462|
|  1.0|  0.3330702168941498|
|  1.0| 0.21377330869436265|
|  1.0| 0.44351143538951876|
|  1.0|  0.3278091162443161|
|  1.0|0.041347454154491425|
+-----+--------------------+

Upvotes: 1

Related Questions