Reputation: 395
My RDD has TAB delimited strings in it. I'm trying to filter it: if column 5 contains few strings:
filt_data = raw_data.filter(lambda x: '' if len(x.split('\t')) < 5 else "apple" in x.split('\t')[4] or "pear" in x.split('\t')[4] or "berry" in x.split('\t')[4] or "cherry" in x.split('\t')[4])
I dont think its very effective solution since i'm doing 4 splits of the same row there. Can some1 show more optimal way of doing it?
And what if i have an array of "fruits". How can i filter my RDD that contains elements from this array?
Could do something like that x.split('\t')[4] in array
but it will filter only if an array element is equal to column 5 item, but i need to check if column 5 contains any of the strings in array.
Upvotes: 0
Views: 2129
Reputation: 10450
You can replace the lambda function, with a "real" function which will do whatever you like, in an efficient way. See below a prototype of the suggested solution
def efficient_func(line):
if len(x.split('\t')) < 5:
return ''
word = line.split('\t')[4]
...
return ...
filt_data = raw_data.filter(efficient_func)
Regarding the 2nd question - I think that using one "if" statement should be better than using several "if" statements. e.g.
fruits_array = ['apple','pear','berry','cherry']
if word in fruits_array:
do_something (or return some_value)
Upvotes: 1