Scala Spark not reading ignoring first line header and loading all data from 2nd line onwards

Question

I have a Scala Spark notebook on an AWS EMR cluster that loads data from an AWS S3 bucket. Previously, I had standard code like the following:

var stack = spark.read.option("header", "true").csv("""s3://someDirHere/*""")

This loaded multiple directories of files (.txt.gz) into a Spark DataFrame object called stack.

Recently, there were new files added to this directory. The content of the new files look the same (I downloaded a couple of them and opened them using both Sublime Text and Notepad++). I tried two different text editors to see if there were perhaps some invisible, non-unicode characters that was disrupting the interpretation of the first line as a header. The new data files causes my code above to ignore the first header line and instead interpret the second line as the header. I have tried a few variations without luck, here are a few examples of things I tried:

var stack = spark.read.option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected

var stack = spark.read.option("escape", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected

var stack = spark.read.option("escape", "\"").option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected

I wish I could share the files but it contains confidential information. Just wondering if there are some ideas as to what I can try.

Matt · Accepted Answer

how many files are there? if its to much to check manually you could try to read them withouth the header option. Your expectation is that the header matches everywhere right?

If thats truly the case that should have a count of 1:

spark.read.csv('path').limit(1).dropDuplicates().count()

If not you could see like this what different headers there are

spark.read.csv('path').limit(1).dropDuplicates().show()

Remember its important not to use the header option, so you can operate on it

Scala Spark not reading ignoring first line header and loading all data from 2nd line onwards

Answers (1)

Related Questions