nikos
nikos

Reputation: 3003

PySpark - Split records of an RDD by one or more tabs

I have created an RDD from an input file, which looks like this:

[u'$, Claw\t\t\t"OnCreativity" (2012)  [Himself]']
[u'$, Homo\t\t\tNykytaiteen museo (1986)  [Himself]  <25>\n\t\t\tSuuri illusioni (1985)  [Guests]  <22>']
[u'$, Steve\t\tE.R. Sluts (2003) (V)  <12>']

It is easy to split each record in this RDD based on a tab character, '\t', but what I would like to get is each record splitted based on one or more tabs.

I have tried the usual ways of doing that to Python, e.g. when someone wants to split a string based on one or more tabs, but these solutions do not seem to be working in the context of PySpark when trying to split an RDD record.

Upvotes: 1

Views: 2186

Answers (1)

zero323
zero323

Reputation: 330153

I am not exactly sure what you mean by a set of RDDs but it looks like what you need here is a simple regular expression:

import re
pattern = re.compile("\t+")

rdd = sc.parallelize([
    u"foo\t\t\t\tbar",
    u"123\t\t\t456\t\t789\t0"
])

rdd.map(lambda x: pattern.split(x)).collect()

## [[u'foo', u'bar'], [u'123', u'456', u'789', u'0']]

Upvotes: 2

Related Questions