Sayantan
Sayantan

Reputation: 101

data losing while reading a file of huge size in spark scala

val data = spark.read
    .text(filepath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())
val count = data.count()

This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.

Upvotes: 0

Views: 296

Answers (1)

toxicafunk
toxicafunk

Reputation: 406

Seems related to How do I add an persistent column of row ids to Spark DataFrame?

i.e. avoid using monotonically_increasing_id and follow some of the suggestions from that thread.

Upvotes: 1

Related Questions