Reputation: 3003
I have created an RDD from an input file, which looks like this:
[u'$, Claw\t\t\t"OnCreativity" (2012) [Himself]']
[u'$, Homo\t\t\tNykytaiteen museo (1986) [Himself] <25>\n\t\t\tSuuri illusioni (1985) [Guests] <22>']
[u'$, Steve\t\tE.R. Sluts (2003) (V) <12>']
It is easy to split each record in this RDD based on a tab character, '\t'
, but what I would like to get is each record splitted based on one or more tabs.
I have tried the usual ways of doing that to Python, e.g. when someone wants to split a string based on one or more tabs, but these solutions do not seem to be working in the context of PySpark when trying to split an RDD record.
Upvotes: 1
Views: 2186
Reputation: 330153
I am not exactly sure what you mean by a set of RDDs but it looks like what you need here is a simple regular expression:
import re
pattern = re.compile("\t+")
rdd = sc.parallelize([
u"foo\t\t\t\tbar",
u"123\t\t\t456\t\t789\t0"
])
rdd.map(lambda x: pattern.split(x)).collect()
## [[u'foo', u'bar'], [u'123', u'456', u'789', u'0']]
Upvotes: 2