Reputation: 395
I have text file and need to filter lines that have more than 6 words. I create the RDD:
my_data = sc.textFile("lines.txt")
Then i split each line in words:
line_words = my_data.map(lambda x: x.split(' '))
And apply filter and save results to a file:
filtered_lines = line_words.filter(lambda x: len(x) > 6)
filtered_lines.saveAsTextFile("out")
And from initial file:
hello world
its fun to have fun but you have to know how
I get:
[u'its', u'fun', u'to', u'have', u'fun', u'but', u'you', u'have', u'to', u'know', u'how']
How do i combine words back into a line without brackets and u''
?
I know better do something like that:
my_data.filter(lambda x: len(x.split(' ')) > 6).saveAsTextFile("out")
But i want to learn how to make results readable for human.
Upvotes: 1
Views: 681
Reputation: 928
Couldn't you save yourself some processing by skipping the split altogether? You are only counting the number of words for the filter (e.g. the number of spaces in the line).
Alternatively, you can do the split and > 6 filter in the same filter step. Since you are not utilizing the individual words, there is no need to keep that data structure around.
Upvotes: 1
Reputation: 976
You can use the string.join(array) function to convert elements of an array to a string , where string is the delimiter:
line = [u'its', u'fun', u'to', u'have', u'fun', u'but', u'you', u'have', u'to', u'know', u'how']
sentence = " ".join(line)
Is this what you mean?
Upvotes: 1