PySpark - Sort RDD by Second Column

Question

I've this RDD:

[[u''], [u'E01', u'Lokesh'], [u'E10', u'Venkat'], [u'EO2', u'Bhupesh'], [u'EO3', u'Amit'], [u'EO4', u'Ratan'], [u'EO5', u'Dinesh'], [u'EO6', u'Pavan'], [u'EO7', u'Tejas'], [u'EO8', u'Sheela']]

And I want to sort by the second column (name). I try this but without success:

[u'EO3', u'Amit'], 
[u'EO2', u'Bhupesh'], 
[u'EO5', u'Dinesh'], 
[u'E01', u'Lokesh'], 
[u'EO6', u'Pavan'],
[u'EO8', u'Sheela'],
[u'EO7', u'Tejas'],
[u'E10', u'Venkat']

I try with this:

sorted = employee_rows.sortBy(lambda line: line[1])

But it gives me this:

IndexError: list index out of range

How can sortby the second column?

Thanks!

pault · Accepted Answer

In general, you should make all of your higher order rdd functions robust to bad inputs. In this case, your error is because you have at least one record that does not have a second column.

One way is to put a condition check on the length of line inside the lambda:

employee_rows.sortBy(lambda line: line[1] if len(line) > 1 else None).collect()
#[[u''],
# [u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat']]

Or you could define a custom sort function with try/except. Here's a way to make the "bad" rows sort last:

def mysort(line):
    try:
        return line[1]
    except:
        # since you're sorting alphabetically
        return 'Z'

employee_rows.sortBy(mysort).collect()
#[[u'EO3', u'Amit'],
# [u'EO2', u'Bhupesh'],
# [u'EO5', u'Dinesh'],
# [u'E01', u'Lokesh'],
# [u'EO6', u'Pavan'],
# [u'EO4', u'Ratan'],
# [u'EO8', u'Sheela'],
# [u'EO7', u'Tejas'],
# [u'E10', u'Venkat'],
# [u'']]

PySpark - Sort RDD by Second Column

Answers (1)

Related Questions