jeannetton
jeannetton

Reputation: 37

Pyspark lambda operation to create key pairs

I already have code which maps to this

['vita', 'oscura', 'smarrita', 'dura', 'forte', 'paura', 'morte', 'trovai', 'scorte', 'v’intrai']

I want this

[('vita','oscura',1),('oscura','smarrita',1),('smarrita','dura',1), ('dura','forte',1) etc

I thought that I could do this via a lambda function, where for every line, i ask for the first row, first item, then I ask for first row second column, which fails bc of an out of index error, any points on how I could go about this?

this is my code so far

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

clean_dcr=dcr.map(lower_clean_str)
print(clean_dcr.take(10))

#we split on whitespaces as in ex1, notice how this time we take [-1] to grab only the first word
clean_dcr=clean_dcr.map(lambda line: line.split()[-1])
print(clean_dcr.take(10))

#this gives an error
#clean_dcr=clean_dcr.map((lambda line:line[0][0],line[0][1])),1)
#print(clean_dcr.take(3))

Upvotes: 1

Views: 62

Answers (1)

Yaman Jain
Yaman Jain

Reputation: 1247

For Python 3.10 and above one can use pairwise

Sample code snippet can be,

import itertools

input_list = ['vita', 'oscura', 'smarrita', 'dura', 'forte', 'paura', 'morte', 'trovai', 'scorte', 'v’intrai']

output = [element + (1, ) for element in itertools.pairwise(input_list)]

For python versions below 3.10 one can use reference implementation of pairwise which is also mentioned in the link

Upvotes: 1

Related Questions