Reputation: 2981
I am creating an RDD
by loading the data from a text file in PySpark
. Now I want to convert this RDD
into a dataframe
but I do not know how many and what columns are present in the RDD
. I am trying to use createDataFrame()
and syntax shown for it is sqlDataFrame = sqlContext.createDataFrame(rdd, schema)
. I tried to see how to create the schema
but most of the examples show a hardcoded schema creation example. Now since I do not know what are the columns so how can I convert the rdd
into a dataframe
? Here is my code so far:
from pyspark import SQLContext
sqlContext = SQLContext(sc)
example_rdd = sc.textFile("\..\file1.csv")
.map(lambda line: line.split(","))
#convert the rdd into a dataframe
# df = sc.createDataFrame() # dataframe conversion here.
NOTE 1: The reason I do not know the columns is because I am trying to create a general script that can create dataframe from an RDD read from any file with any number of columns.
NOTE 2: I know there is another function called toDF()
that can convert RDD to dataframe but wuth that too I have the same issue as how to pass the unknown columns.
NOTE3: The file format is not just a csv file. I have shown it for an example but it can be any file of any format
Upvotes: 2
Views: 1782
Reputation: 4044
Spark 2.0.0 onwards supports reading csv directly as a DataFrame. In order to read a csv, use the DataFrameReader.csv method
df = spark.read.csv("\..\file1.csv", header=True)
In your case, if you do not have access to spark object, you can use,
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("\..\file1.csv", header=True)
In case the file has a different separator, you can specify that too.
# Eg if separator is ::
df = spark.read.csv("\..\file1.csv", head=True,sep="::")
Upvotes: 4