Bo Qiang
Bo Qiang

Reputation: 789

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:

from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()

But I got an error:

createDataFrame() missing 1 required positional argument: 'data'

I don't understand why this happens because I already supplied 'data', which is the variable rows.

Thanks

Upvotes: 3

Views: 25014

Answers (3)

Naveen Nelamali
Naveen Nelamali

Reputation: 1164

Below are the steps to create pyspark dataframe using createDataFrame

Create sparksession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

Create data and columns

columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

Creating DataFrame from RDD

rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)

the second approach, Directly creating dataframe

df2 = spark.createDataFrame(data).toDF(*columns)

Upvotes: 2

Suresh
Suresh

Reputation: 5870

You have to create SparkSession instance using the build pattern and use it for creating dataframe, check https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession

spark= SparkSession.builder.getOrCreate()

Upvotes: 4

lizardfireman
lizardfireman

Reputation: 369

Try row = [(1,), (2,), (3,)] ? If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

Upvotes: 0

Related Questions