gpanda
gpanda

Reputation: 875

Pyspark Pair RDD from Text File

I have a local text file kv_pair.log formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:

"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"

I am trying to read this to a Pair RDD using pySpark as follows:

from pyspark import SparkContext
sc=sparkContext()

# Read raw text to RDD
lines=sc.textFile('kv_pair.log')

# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))

print type(pairs)
print pairs.take(2)

I feel I am close! The output of above is:

[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]

So it looks like pairs is a list of records, which contains a list of the kv pairs as strings.

How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?

Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.

Upvotes: 0

Views: 9977

Answers (2)

mrsrinivas
mrsrinivas

Reputation: 35434

You can use flatMap with a custom function as lambda can't be used for multiple statements

def tranfrm(x):
    lst = x.replace('"', '').split(",")
    return [(x.split("=")[0], x.split("=")[1]) for x in lst]

pairs = lines.map(tranfrm)

Upvotes: 1

grepe
grepe

Reputation: 1977

This is really bad practice for a parser, but I believe your example could be done with something like this:

from pyspark import SparkContext
from pyspark.sql import Row

sc=sparkContext()

# Read raw text to RDD
lines=sc.textFile('kv_pair.log')

# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
           .map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )

print type(pairs)
print pairs.take(2)

Upvotes: 0

Related Questions