Filtering in pyspark

Question

My RDD has TAB delimited strings in it. I'm trying to filter it: if column 5 contains few strings:

filt_data = raw_data.filter(lambda x: '' if len(x.split('	')) < 5 else "apple" in x.split('	')[4] or "pear" in x.split('	')[4] or "berry" in x.split('	')[4] or "cherry" in x.split('	')[4])

I dont think its very effective solution since i'm doing 4 splits of the same row there. Can some1 show more optimal way of doing it?

And what if i have an array of "fruits". How can i filter my RDD that contains elements from this array? Could do something like that x.split(' ')[4] in array but it will filter only if an array element is equal to column 5 item, but i need to check if column 5 contains any of the strings in array.

Yaron · Accepted Answer

You can replace the lambda function, with a "real" function which will do whatever you like, in an efficient way. See below a prototype of the suggested solution

def efficient_func(line):
    if len(x.split('	')) < 5:
        return ''
    word = line.split('	')[4]
    ...

    return ...

filt_data = raw_data.filter(efficient_func)

Regarding the 2nd question - I think that using one "if" statement should be better than using several "if" statements. e.g.

fruits_array = ['apple','pear','berry','cherry']
if word in fruits_array:
  do_something (or return some_value)

Filtering in pyspark

Answers (1)

Related Questions