Reputation: 3728
I am trying to convert spark RDD to dataframe. While RDD is fine when I convert it to dataframe I get index out of range error.
alarms = sc.textFile("hdfs://nanalyticsedge.com:8020/hdp/oneday.csv")
alarms = alarms.map(lambda line: line.split(","))
header = alarms.first()
alarms = alarms.filter(lambda line:line != header)
alarms = alarms.filter(lambda line: len(line)>1)
alarms_df = alarms.map(lambda line: Row(IDENTIFIER=line[0],SERIAL=line[1],NODE=line[2],NODEALIAS=line[3],MANAGER=line[4],AGENT=line[5],ALERTGROUP=line[6],ALERTKEY=line[7],SEVERITY=line[8],SUMMARY=line[9])).toDF()
alarms_df.take(100)
Here alarms.count() works fine whereas alarms_df.count() gives index out of range. It is data export from oracle
From @Dikei's answer I found that:
alarms = alarms.filter(lambda line: len(line) == 10)
gives me proper Dataframe but why do dataframe get lost when it is database export and how do I prevent it?
Upvotes: 1
Views: 1698
Reputation: 451
No data with the index mentioned. Try something like, if the array has more than 9 print 10th element
myData.foreach { x => if(x.size.!=(9)){println(x(10))} }
Upvotes: 0
Reputation: 11383
I thinks the problem is some of your lines do not contain 10 elements. It's easy to check, try changing
alarms = alarms.filter(lambda line: len(line)>1)
to
alarms = alarms.filter(lambda line: len(line) == 10)
Upvotes: 3