Reputation: 875
I have a local text file kv_pair.log
formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs
is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.
Upvotes: 0
Views: 9977
Reputation: 35434
You can use flatMap
with a custom function as lambda
can't be used for multiple statements
def tranfrm(x):
lst = x.replace('"', '').split(",")
return [(x.split("=")[0], x.split("=")[1]) for x in lst]
pairs = lines.map(tranfrm)
Upvotes: 1
Reputation: 1977
This is really bad practice for a parser, but I believe your example could be done with something like this:
from pyspark import SparkContext
from pyspark.sql import Row
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
.map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )
print type(pairs)
print pairs.take(2)
Upvotes: 0