ishwar
ishwar

Reputation: 298

To do the same in python output should be same as the scala output

LINK TO data.csv In scala the code gives an Array of string but in python I wanted the same output just like scala : Code in Scala:

val spark = SparkSession.builder()
      .appName("Test_Parquet")
      .master("local[*]")
      .getOrCreate()
    val sc = spark.sparkContext

    val parquetDF = spark.read.csv("data.csv")
    parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")
    val rdd = spark.read.parquet("Parquet").rdd
    val header = rdd.first()

    val rdd1 = rdd.filter(_ != header).map(x => x.toString)
    rdd1.foreach(println)

OUTPUT:

[Canada,47;97;33;94;6] [Canada,59;98;24;83;3] [Canada,77;63;93;86;62] [China,86;71;72;23;27] [China,74;69;72;93;7] [China,58;99;90;93;41] [England,40;13;85;75;90] [England,39;13;33;29;14] [England,99;88;57;69;49] [Germany,67;93;90;57;3] [Germany,0;9;15;20;19] [Germany,77;64;46;95;48] [India,90;49;91;14;70] [India,70;83;38;27;16] [India,86;21;19;59;4]

Code in Python:

spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()

parquetDF = spark.read.csv("data.csv")

parquetDF.coalesce(1).write.mode("overwrite").parquet("Parquet")
rdd = spark.read.parquet("Parquet").rdd
header = rdd.first()
print(header)
rdd1 = rdd.filter(lambda line: header != line).map(lambda x: str(x))
rdd1.foreach(print)

The output of the python is different than the scala were i'm doing the same thing in python

Upvotes: 0

Views: 31

Answers (1)

pissall
pissall

Reputation: 7399

I think rdd1.foreach(print) should work, but since you're converting from a DataFrame you will get Row objects instead.

I think the following should work:

rdd1.map(list).foreach(print)

Difference:

df.rdd.foreach(print)
# Row(Name='John', gender='Male', state='GA')
# Row(Name='Mary', gender='Female', state='GA')
# Row(Name='Alex', gender='Male', state='NY')
# Row(Name='Ana', gender='Female', state='NY')
# Row(Name='Amy', gender='Female', state='NY')

df.rdd.map(list).foreach(print)
# ['John', 'Male', 'GA']
# ['Mary', 'Female', 'GA']
# ['Alex', 'Male', 'NY']
# ['Ana', 'Female', 'NY']
# ['Amy', 'Female', 'NY']

Note: If this is not your exact problem, then please provide the actual and expected output

Upvotes: 2

Related Questions