Matthias
Matthias

Reputation: 5764

PySpark: how to map by first item in array

My initial RDD is a list of blocks in which each block is a list of lines itself. So it's

[infos_var1, infos_var2]

and each block is

var_name, var_value1, var_value2, var_value3

The original data looks like this:

[[u'::852-YF-007\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0'],
 [u'::852-YF-008\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0']]

My question is how to use a map-function to extract the variable name (852-YF-007 and 852-YF-008) as key and as value the lines with the timestamp (here: 3 lines for each variable?

Maybe someone can give me a hint how to use map on my RDD. I was thinking of something like this:

df.map(lambda (k, v): (v[0], v[0-vEND]))

PS: The original post on how I created my initial RDD can be found here.

Upvotes: 2

Views: 1648

Answers (1)

ccheneson
ccheneson

Reputation: 49410

What you have is a list of list of items and not tuple

Try this:

df.map(lambda i : (i[0], i[1:]))

For the i[1:] part , look up for slicing in here

Upvotes: 1

Related Questions