combining lists inside values in pyspark

Question

Performing collect on a RDD gave me a list. I iterated over it to print the structure, using the code,

for entry in ratings_and_users.collect():
        print(entry)

The output is,

(b'"20599"', ([7.0, b'"349802972X"'], ['bamberg, franken, germany', 'NULL']))
(b'"120675"', ([0.0, b'"0972189408"'], ['crescent city, california, usa', 45]))
(b'"166487"', ([6.0, b'"8422626993"'], ['santander, n/a, spain', 103]))
(b'"166487"', ([7.0, b'"8440639228"'], ['santander, n/a, spain', 103]))

In pyspark, i need to write a lambda, to join all the lists in the value into a single list. For example, in the above output, every line is a key value pair, the key b'"166487"' has a list as its value ([7.0, b'"8440639228"'], ['santander, n/a, spain', 103]). The value contains multiple lists, how can i join them into a single list before performing collect on the RDD

Required output structure:

(b'"166487"', ([7.0, b'"8440639228"', 'santander, n/a, spain', 103]))

Mitty · Accepted Answer

the problem was I considered each item from the result of collect operation as a key value pair, but instead it's a Tuple with key as first entry and value, the second. So I iterated upon then using the following lambda, and I got the result.

def append_values_inside(key, value):
    temp = []
    for v in value:
        for entry in v:
            temp.append(entry)
    return (key, temp)
for entry in ratings_and_users.map(lambda a: append_values_inside(a[0], a[1])).collect() :
        print(entry)

Final result:

(b'"20599"', [7.0, b'"349802972X"', 'bamberg, franken, germany', 'NULL'])
(b'"120675"', [0.0, b'"0972189408"', 'crescent city, california, usa', 45])
(b'"166487"', [6.0, b'"8422626993"', 'santander, n/a, spain', 103])
(b'"166487"', [7.0, b'"8440639228"', 'santander, n/a, spain', 103])

combining lists inside values in pyspark

Answers (1)

Related Questions