user2966197
user2966197

Reputation: 2981

How to convert a PySpark RDD to a Dataframe with unknown columns?

I am creating an RDD by loading the data from a text file in PySpark. Now I want to convert this RDD into a dataframe but I do not know how many and what columns are present in the RDD. I am trying to use createDataFrame() and syntax shown for it is sqlDataFrame = sqlContext.createDataFrame(rdd, schema). I tried to see how to create the schema but most of the examples show a hardcoded schema creation example. Now since I do not know what are the columns so how can I convert the rdd into a dataframe? Here is my code so far:

from pyspark import SQLContext
sqlContext = SQLContext(sc)

example_rdd = sc.textFile("\..\file1.csv")
               .map(lambda line: line.split(",")) 

#convert the rdd into a dataframe
# df = sc.createDataFrame() # dataframe conversion here.

NOTE 1: The reason I do not know the columns is because I am trying to create a general script that can create dataframe from an RDD read from any file with any number of columns.

NOTE 2: I know there is another function called toDF() that can convert RDD to dataframe but wuth that too I have the same issue as how to pass the unknown columns.

NOTE3: The file format is not just a csv file. I have shown it for an example but it can be any file of any format

Upvotes: 2

Views: 1782

Answers (1)

Spandan Brahmbhatt
Spandan Brahmbhatt

Reputation: 4044

Spark 2.0.0 onwards supports reading csv directly as a DataFrame. In order to read a csv, use the DataFrameReader.csv method

df = spark.read.csv("\..\file1.csv", header=True)

In your case, if you do not have access to spark object, you can use,

from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("\..\file1.csv", header=True)

In case the file has a different separator, you can specify that too.

# Eg if separator is ::
df = spark.read.csv("\..\file1.csv", head=True,sep="::")

Upvotes: 4

Related Questions