Reputation: 101
val data = spark.read
.text(filepath)
.toDF("val")
.withColumn("id", monotonically_increasing_id())
val count = data.count()
This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.
Upvotes: 0
Views: 296
Reputation: 406
Seems related to How do I add an persistent column of row ids to Spark DataFrame?
i.e. avoid using monotonically_increasing_id
and follow some of the suggestions from that thread.
Upvotes: 1