Reputation: 764
Is there any way to use custom record delimiters while reading a csv file in pyspark. In my file records are separated by ** instead of newline. Is there any way of using this custom line/record separator when reading the csv into a PySpark dataframe? Also my column seperators are ';' The code below gets the columns correctly but it counts as only one row
from pyspark import SparkContext
sc = SparkSession.builder.appName('temp').getOrCreate()
df = sc.read.format('csv').option("header", "false").option("delimiter", ';').option("inferSchema", "true").load("some-file-on-s3")
Upvotes: 0
Views: 1342
Reputation: 1228
I would read it as a pure text file into a rdd and then split on the character that is your line break. Afterwards convert it to a dataframe Like this
rdd1= (sc
.textFile("/jupyter/nfs/test.txt")
.flatMap(lambda line: line.split("**"))
.map(lambda x: x.split(";"))
)
df1=rdd1.toDF(["a","b","c"])
df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+
or if like this
rdd2= (sc
.textFile("/jupyter/nfs/test.txt")
.flatMap(lambda line: line.split("**"))
.map(lambda x: [x])
)
df2=(rdd2
.toDF(["abc"])
.withColumn("a",f.split(f.col("abc"),";")[0])
.withColumn("b",f.split(f.col("abc"),";")[1])
.withColumn("c",f.split(f.col("abc"),";")[2])
.drop("abc")
)
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+
where the test.txt looks like
a1;b1;c1**a2;b2;c2**a3;b2;c3
Upvotes: 1