data losing while reading a file of huge size in spark scala

Question

val data = spark.read
    .text(filepath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())
val count = data.count()

This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.

data losing while reading a file of huge size in spark scala

Answers (1)

Related Questions