Rakesh Adhikesavan
Rakesh Adhikesavan

Reputation: 12836

PySpark: Filter a DataFrame using condition

I have the following sample DataFrame:

l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)]
df = spark.createDataFrame(l)


| Alice went to wonderland  |
| qwertyuiopqwert some text |
| hello world               |
| ThisGetsFilteredToo       |

Given this DataFrame, I want to filter out the rows that contain even one word that has a length > 15 characters. In this example, row 2 has the word 'qwertyuiopqwert' which has a length > 15. So it should get dropped. Similarly, row 4 should be dropped too.

Upvotes: 0

Views: 3937

Answers (2)

markwatsonatx
markwatsonatx

Reputation: 3501

While the previous answer seems correct I think you can do this with a simple user defined function. Create the function to split the string and find any word with length > 15:

def no_long_words(s):
    for word in s.split():
        if len(word) > 15:
            return False
    return True

Create the udf:

from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())

Run a filter on the dataframe using the udf:

df2 = df.filter(no_long_words_udf('col1'))
df2.show()

+--------------------+
|                col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
|         hello world|
+--------------------+

Note: qwertyuiopqwert is actually 15 characters long, so it is included in the results.

Upvotes: 2

StackPointer
StackPointer

Reputation: 529

from pyspark.sql.functions import udf,col
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = ['athshgthsc asl','sdf sdfdsadf sdf', 'arasdfa sdf','aa bb','aaa bbb ccc','dd aa bbb']
df = sqlContext.createDataFrame(data,StringType())

def getLenghts(lst):
    tempLst = []
    for ele in lst:
        tempLst.append(len(ele))
    return tempLst

getList = udf(lambda data:data.split(),StringType())
getListLen = udf(getLenghts,ArrayType(IntegerType()))
getMaxLen = udf(lambda data:max(data),IntegerType())

df = (df.withColumn('splitWords',getList(df.value))
        .withColumn('lengthList',getListLen(col('splitWords')))
        .withColumn('maxLen',getMaxLen('lengthList')))
df.filter(df.maxLen<5).select('value').show()




+----------------+
|           value|
+----------------+
|  athshgthsc asl|
|sdf sdfdsadf sdf|
|     arasdfa sdf|
|           aa bb|
|     aaa bbb ccc|
|       dd aa bbb|
+----------------+

+----------------+--------------------+----------+------+
|           value|          splitWords|lengthList|maxLen|
+----------------+--------------------+----------+------+
|  athshgthsc asl|   [athshgthsc, asl]|   [10, 3]|    10|
|sdf sdfdsadf sdf|[sdf, sdfdsadf, sdf]| [3, 8, 3]|     8|
|     arasdfa sdf|      [arasdfa, sdf]|    [7, 3]|     7|
|           aa bb|            [aa, bb]|    [2, 2]|     2|
|     aaa bbb ccc|     [aaa, bbb, ccc]| [3, 3, 3]|     3|
|       dd aa bbb|       [dd, aa, bbb]| [2, 2, 3]|     3|
+----------------+--------------------+----------+------+

+-----------+
|      value|
+-----------+
|      aa bb|
|aaa bbb ccc|
|  dd aa bbb|
+-----------+

Can be modified to keep length > 15. Also more pre-processing can be performed before splitting the dataset. For me I have kept length > 5 to be filtered out.

Upvotes: 0

Related Questions