Reputation: 1054
I've this RDD:
[[u''], [u'E01', u'Lokesh'], [u'E10', u'Venkat'], [u'EO2', u'Bhupesh'], [u'EO3', u'Amit'], [u'EO4', u'Ratan'], [u'EO5', u'Dinesh'], [u'EO6', u'Pavan'], [u'EO7', u'Tejas'], [u'EO8', u'Sheela']]
And I want to sort by the second column (name). I try this but without success:
[u'EO3', u'Amit'],
[u'EO2', u'Bhupesh'],
[u'EO5', u'Dinesh'],
[u'E01', u'Lokesh'],
[u'EO6', u'Pavan'],
[u'EO8', u'Sheela'],
[u'EO7', u'Tejas'],
[u'E10', u'Venkat']
I try with this:
sorted = employee_rows.sortBy(lambda line: line[1])
But it gives me this:
IndexError: list index out of range
How can sortby the second column?
Thanks!
Upvotes: 1
Views: 481
Reputation: 43504
In general, you should make all of your higher order rdd functions robust to bad inputs. In this case, your error is because you have at least one record that does not have a second column.
One way is to put a condition check on the length of line
inside the lambda
:
employee_rows.sortBy(lambda line: line[1] if len(line) > 1 else None).collect()
#[[u''],
# [u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat']]
Or you could define a custom sort function with try/except
. Here's a way to make the "bad" rows sort last:
def mysort(line):
try:
return line[1]
except:
# since you're sorting alphabetically
return 'Z'
employee_rows.sortBy(mysort).collect()
#[[u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat'],
# [u'']]
Upvotes: 3