To do the same in python output should be same as the scala output

Question

LINK TO data.csv In scala the code gives an Array of string but in python I wanted the same output just like scala : Code in Scala:

val spark = SparkSession.builder()
      .appName("Test_Parquet")
      .master("local[*]")
      .getOrCreate()
    val sc = spark.sparkContext

    val parquetDF = spark.read.csv("data.csv")
    parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")
    val rdd = spark.read.parquet("Parquet").rdd
    val header = rdd.first()

    val rdd1 = rdd.filter(_ != header).map(x => x.toString)
    rdd1.foreach(println)

OUTPUT:

[Canada,47;97;33;94;6] [Canada,59;98;24;83;3] [Canada,77;63;93;86;62] [China,86;71;72;23;27] [China,74;69;72;93;7] [China,58;99;90;93;41] [England,40;13;85;75;90] [England,39;13;33;29;14] [England,99;88;57;69;49] [Germany,67;93;90;57;3] [Germany,0;9;15;20;19] [Germany,77;64;46;95;48] [India,90;49;91;14;70] [India,70;83;38;27;16] [India,86;21;19;59;4]

Code in Python:

spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()

parquetDF = spark.read.csv("data.csv")

parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")
rdd = spark.read.parquet("Parquet").rdd
header = rdd.first()
print(header)
rdd1 = rdd.filter(lambda line: header != line).map(lambda x: str(x))
rdd1.foreach(print)

The output of the python is different than the scala were i'm doing the same thing in python

pissall · Accepted Answer

I think rdd1.foreach(print) should work, but since you're converting from a DataFrame you will get Row objects instead.

I think the following should work:

rdd1.map(list).foreach(print)

Difference:

df.rdd.foreach(print)
# Row(Name='John', gender='Male', state='GA')
# Row(Name='Mary', gender='Female', state='GA')
# Row(Name='Alex', gender='Male', state='NY')
# Row(Name='Ana', gender='Female', state='NY')
# Row(Name='Amy', gender='Female', state='NY')

df.rdd.map(list).foreach(print)
# ['John', 'Male', 'GA']
# ['Mary', 'Female', 'GA']
# ['Alex', 'Male', 'NY']
# ['Ana', 'Female', 'NY']
# ['Amy', 'Female', 'NY']

Note: If this is not your exact problem, then please provide the actual and expected output

To do the same in python output should be same as the scala output

Answers (1)

Related Questions