Pyspark lambda operation to create key pairs

Question

I already have code which maps to this

['vita', 'oscura', 'smarrita', 'dura', 'forte', 'paura', 'morte', 'trovai', 'scorte', 'v’intrai']

I want this

[('vita','oscura',1),('oscura','smarrita',1),('smarrita','dura',1), ('dura','forte',1) etc

I thought that I could do this via a lambda function, where for every line, i ask for the first row, first item, then I ask for first row second column, which fails bc of an out of index error, any points on how I could go about this?

this is my code so far

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

clean_dcr=dcr.map(lower_clean_str)
print(clean_dcr.take(10))

#we split on whitespaces as in ex1, notice how this time we take [-1] to grab only the first word
clean_dcr=clean_dcr.map(lambda line: line.split()[-1])
print(clean_dcr.take(10))

#this gives an error
#clean_dcr=clean_dcr.map((lambda line:line[0][0],line[0][1])),1)
#print(clean_dcr.take(3))

Pyspark lambda operation to create key pairs

Answers (1)

Related Questions