David Schuler
David Schuler

Reputation: 1031

Scala - Remove first row of Spark DataFrame

I know dataframes are supposed to be immutable and everything and I know it's not a great idea to try to change them. However, the file I'm receiving has a useless header of 4 columns (the whole file has 50+ columns). So, what I"m trying to do is just get rid of the very top row because it throws everything off.

I've tried a number of different solutions (mostly found on here) like using .filter() and map replacements, but haven't gotten anything to work.

Here's an example of how the data looks:

H | 300 | 23098234 | N
D | 399 | 54598755 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 654 | 65465465 | Y | 09983 | 09823 | 02983 | ... | 0987098
D | 198 | 02982093 | Y | 09983 | 09823 | 02983 | ... | 0987098

Any ideas?

Upvotes: 0

Views: 14328

Answers (1)

blr
blr

Reputation: 968

The cleanest way I've seen so far is something along the lines of filtering out the first row

csv_rows           = sc.textFile('path_to_csv')
skipable_first_row = csv_rows.first() 
useful_csv_rows    = csv_rows.filter(row => row != skipable_first_row)   

Upvotes: 2

Related Questions