How to convert a PySpark RDD to a Dataframe with unknown columns?

Question

I am creating an RDD by loading the data from a text file in PySpark. Now I want to convert this RDD into a dataframe but I do not know how many and what columns are present in the RDD. I am trying to use createDataFrame() and syntax shown for it is sqlDataFrame = sqlContext.createDataFrame(rdd, schema). I tried to see how to create the schema but most of the examples show a hardcoded schema creation example. Now since I do not know what are the columns so how can I convert the rdd into a dataframe? Here is my code so far:

from pyspark import SQLContext
sqlContext = SQLContext(sc)

example_rdd = sc.textFile("\..\file1.csv")
               .map(lambda line: line.split(",")) 

#convert the rdd into a dataframe
# df = sc.createDataFrame() # dataframe conversion here.

NOTE 1: The reason I do not know the columns is because I am trying to create a general script that can create dataframe from an RDD read from any file with any number of columns.

NOTE 2: I know there is another function called toDF() that can convert RDD to dataframe but wuth that too I have the same issue as how to pass the unknown columns.

NOTE3: The file format is not just a csv file. I have shown it for an example but it can be any file of any format

How to convert a PySpark RDD to a Dataframe with unknown columns?

Answers (1)

Related Questions