Dharm
Dharm

Reputation: 23

Is there better way to write this pyspark split code?

Learning bigdata and pyspark.

I have got RDD customer which has

[u'1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521']

To get customer number and customer first + last name tuple, i have below code.

custname = customers.map(lambda x: (x.split(",")[8], x.split(",")[1] + " " +  x.split(",")[2]))

So my tuple would be (78521, (Richard Hernandez))

Is there any better way to write above code ie instead of splitting 3 times, can there be one split and elements of split can be concatenated or something similar

Upvotes: 0

Views: 61

Answers (2)

jxc
jxc

Reputation: 13998

Use flatMap() + list comprehension:

>>> customers.flatMap(lambda x: [ (e[8], e[1]+' '+e[2]) for e in [x.split(",")] ]).collect()
[(u'78521', u'Richard Hernandez')]

BTW. you can certainly write a function for your task:

def myfunc1(x):
    arr = x.split(',')
    return (arr[8], arr[1]+' '+arr[2])

customers.map(myfunc1).collect()
# [(u'78521', u'Richard Hernandez')]

Or:

def myfunc2(arr): return (arr[8], arr[1]+' '+arr[2])
customers.map(lambda x: myfunc2(x.split(','))).collect()

Or:

customers.map(lambda x: (lambda y: (y[8], y[1]+' '+y[2]))(x.split(','))).collect()

Upvotes: 1

Nishu Tayal
Nishu Tayal

Reputation: 20850

You can use first split the customers and call another map to form customer name as following:

customers_data = customers.map(lambda x: x.split(","))
custname = customers_data.map(lambda x: (x[8], x[1] + " " +  x[2]))

Upvotes: 0

Related Questions