Abhinav Singh
Abhinav Singh

Reputation: 144

Addition of elements in multiple list corresponding to a key column of RDD

I have PythonRDDs. I have to perform addition of elements for multiple lists. Add element 1 of list 1 to element 1 of list 2 then add to element 1 of list 3. For Canada, add 47,59,77 as element 1, 97,98,63 as second element and so on.

I tried to flatten the list to add them and tried to convert to dataframe but I failed to do so. And I want to do it through all 3 ways

countryCounts = [
('CANADA','47;97;33;94;6'),
('CANADA','59;98;24;83;3'),
('CANADA','77;63;93;86;62'),
('CHINA','86;71;72;23;27'),
('CHINA','74;69;72;93;7'),
('CHINA','58;99;90;93;41'),
('ENGLAND','40;13;85;75;90'),
('ENGLAND','39;13;33;29;14'),
('ENGLAND','99;88;57;69;49'),
('GERMANY','67;93;90;57;3'),
('GERMANY','9;15;20;19'),
('GERMANY','77;64;46;95;48'),
('INDIA','90;49;91;14;70'),
('INDIA','70;83;38;27;16'),
('INDIA','86;21;19;59;4')
]
countryCountsRdd = sc.parallelize(countryCounts)
countryCountsSplit.collect()
countryCountsGroup=countryCountsSplit.groupByKey().mapValues(list)
countryCountsGroup.collect()
CountsSplit=countryCountsRdd.map(lambda x : (x[0], ",".join(x[1].split(';'))))
countryCountsSplit.collect()

Inputs :
Way 1
[('CANADA', [47, 97, 33, 94, 6]), ('CANADA', [59, 98, 24, 83, 3]), ('CANADA', [77, 63, 93, 86, 62]), ('CHINA', [86, 71, 72, 23, 27]), ('CHINA', [74, 69, 72, 93, 7]), ('CHINA', [58, 99, 90, 93, 41]), ('ENGLAND', [40, 13, 85, 75, 90]), ('ENGLAND', [39, 13, 33, 29, 14]), ('ENGLAND', [99, 88, 57, 69, 49]), ('GERMANY', [67, 93, 90, 57, 3]), ('GERMANY', [9, 15, 20, 19]), ('GERMANY', [77, 64, 46, 95, 48]), ('INDIA', [90, 49, 91, 14, 70]), ('INDIA', [70, 83, 38, 27, 16]), ('INDIA', [86, 21, 19, 59, 4])] 
Way 2:
[('CANADA', [[47, 97, 33, 94, 6], [59, 98, 24, 83, 3], [77, 63, 93, 86, 62]]), ('CHINA', [[86, 71, 72, 23, 27], [74, 69, 72, 93, 7], [58, 99, 90, 93, 41]]), ('INDIA', [[90, 49, 91, 14, 70], [70, 83, 38, 27, 16], [86, 21, 19, 59, 4]]), ('ENGLAND', [[40, 13, 85, 75, 90], [39, 13, 33, 29, 14], [99, 88, 57, 69, 49]]), ('GERMANY', [[67, 93, 90, 57, 3], [9, 15, 20, 19], [77, 64, 46, 95, 48]])]
Way 3:
[('CANADA', '47 ,97 ,33 ,94 ,6'), ('CANADA', '59 ,98 ,24 ,83 ,3'), ('CANADA', '77 ,63 ,93 ,86 ,62'), ('CHINA', '86 ,71 ,72 ,23 ,27'), ('CHINA', '74 ,69 ,72 ,93 ,7'), ('CHINA', '58 ,99 ,90 ,93 ,41'), ('ENGLAND', '40 ,13 ,85 ,75 ,90'), ('ENGLAND', '39 ,13 ,33 ,29 ,14'), ('ENGLAND', '99 ,88 ,57 ,69 ,49'), ('GERMANY', '67 ,93 ,90 ,57 ,3'), ('GERMANY', '9 ,15 ,20 ,19'), ('GERMANY', '77 ,64 ,46 ,95 ,48'), ('INDIA', '90 ,49 ,91 ,14 ,70'), ('INDIA', '70 ,83 ,38 ,27 ,16'), ('INDIA', '86 ,21 ,19 ,59 ,4')]

Require same output for all 3 : 
[('CANADA','183;258;150;263;71)]
[('CHINA','218,239,234,209,75')]
[('ENGLAND','178,114,175,173,153')]
[('GERMANY','144,166,151,172,70')]
[('INDIA','246,153,148,100,90')]

Upvotes: 0

Views: 250

Answers (2)

pault
pault

Reputation: 43524

You want to combine the values for a given key by taking the sum. This is precisely what reduceByKey does. You just need to define an associative and commutative reduce function to combine the values as desired.

def myReducer(a, b):
    a, b = map(int, a.split(";")), map(int, b.split(";"))
    maxLength = max(len(a), len(b))
    if len(a) < len(b):
        a = a + [0]*(len(b)-len(a))
    elif len(b) < len(a):
        b = b + [0]*(len(a)-len(b))
    return ";".join([str(a[i] + b[i]) for i in range(maxLength)])

The only real tricky part here is that your sample input lists are not all the same size. In this case, I defined the function to zero pad the shorter list.

Now call reduceByKey:

countryCountsRdd.reduceByKey(myReducer).collect()
#[('CANADA', '183;258;150;263;71'),
# ('CHINA', '218;239;234;209;75'),
# ('INDIA', '246;153;148;100;90'),
# ('ENGLAND', '178;114;175;173;153'),
# ('GERMANY', '153;172;156;171;51')]

Upvotes: 1

Anurag Pandey
Anurag Pandey

Reputation: 11

So you can do it using a simple reduceByKey operation on RDD.

INPUT RDD - RDD[STRING, LIST]

Output RDD - input.reduceByKey(x,y -> addFunction(x,y))

addFunction (x,y) iterates over the input lists and add the elements index wise and returns the added list

Upvotes: 1

Related Questions