LUZO
LUZO

Reputation: 1029

convert rdd to dataframe without schema in pyspark

I'm trying to convert an rdd to dataframe with out any schema. I tried below code. It's working fine, but the dataframe columns are getting shuffled.

def f(x):
    d = {}
    for i in range(len(x)):
        d[str(i)] = x[i]
    return d
rdd = sc.textFile("test")
df = rdd.map(lambda x:x.split(",")).map(lambda x :Row(**f(x))).toDF()
df.show()

Upvotes: 2

Views: 8443

Answers (1)

Shaido
Shaido

Reputation: 28322

If you don't want to specify a schema, do not convert use Row in the RDD. If you simply have a normal RDD (not an RDD[Row]) you can use toDF() directly.

df = rdd.map(lambda x: x.split(",")).toDF()

You can give names to the columns using toDF() as well,

df = rdd.map(lambda x: x.split(",")).toDF("col1_name", ..., "colN_name")

If what you have is an RDD[Row] you need to actually know the type of each column. This can be done by specifying a schema or as follows

val df = rdd.map({ 
  case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")

Upvotes: 4

Related Questions