Infinite
Infinite

Reputation: 764

Read a file in pyspark with custom column and record delmiter

Is there any way to use custom record delimiters while reading a csv file in pyspark. In my file records are separated by ** instead of newline. Is there any way of using this custom line/record separator when reading the csv into a PySpark dataframe? Also my column seperators are ';' The code below gets the columns correctly but it counts as only one row

from pyspark import SparkContext 
sc =  SparkSession.builder.appName('temp').getOrCreate()
df = sc.read.format('csv').option("header", "false").option("delimiter", ';').option("inferSchema", "true").load("some-file-on-s3")

Upvotes: 0

Views: 1342

Answers (1)

Alex Ortner
Alex Ortner

Reputation: 1228

I would read it as a pure text file into a rdd and then split on the character that is your line break. Afterwards convert it to a dataframe Like this

rdd1= (sc
       .textFile("/jupyter/nfs/test.txt")
       .flatMap(lambda line: line.split("**"))
       .map(lambda x: x.split(";"))
      )
df1=rdd1.toDF(["a","b","c"])
df1.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+

or if like this


rdd2= (sc
       .textFile("/jupyter/nfs/test.txt")
       .flatMap(lambda line: line.split("**"))
       .map(lambda x: [x])
      )
df2=(rdd2
     .toDF(["abc"])
     .withColumn("a",f.split(f.col("abc"),";")[0])
     .withColumn("b",f.split(f.col("abc"),";")[1])
     .withColumn("c",f.split(f.col("abc"),";")[2])
     .drop("abc")
    )
df2.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
| a1| b1| c1|
| a2| b2| c2|
| a3| b2| c3|
+---+---+---+

where the test.txt looks like

a1;b1;c1**a2;b2;c2**a3;b2;c3

Upvotes: 1

Related Questions