PySpark - Split records of an RDD by one or more tabs

Question

I have created an RDD from an input file, which looks like this:

[u'$, Claw			"OnCreativity" (2012)  [Himself]']
[u'$, Homo			Nykytaiteen museo (1986)  [Himself]  <25>
			Suuri illusioni (1985)  [Guests]  <22>']
[u'$, Steve		E.R. Sluts (2003) (V)  <12>']

It is easy to split each record in this RDD based on a tab character, ' ', but what I would like to get is each record splitted based on one or more tabs.

I have tried the usual ways of doing that to Python, e.g. when someone wants to split a string based on one or more tabs, but these solutions do not seem to be working in the context of PySpark when trying to split an RDD record.

zero323 · Accepted Answer

I am not exactly sure what you mean by a set of RDDs but it looks like what you need here is a simple regular expression:

import re
pattern = re.compile("	+")

rdd = sc.parallelize([
    u"foo				bar",
    u"123			456		789	0"
])

rdd.map(lambda x: pattern.split(x)).collect()

## [[u'foo', u'bar'], [u'123', u'456', u'789', u'0']]

PySpark - Split records of an RDD by one or more tabs

Answers (1)

Related Questions