funkifunki
funkifunki

Reputation: 1169

regexp in PySpark

I am trying to reproduce the results of the django ORM query in pyspark:

social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)'
Collection.objects.filter(social__iregex=social_filter)

My main problem is that it should be case insensitive.

I have tried this:

social_filter = "social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'"
df = sessions.filter(social_filter)

which result in the following error:

Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.22] failure: end of input expected

social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'

And the following expression:

social_filter = "social  ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)"
df = sessions.filter(social_filter)

crashes with this:

Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.17] failure: identifier expected

social  ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)
       ^
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
    at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

please, help!

Upvotes: 0

Views: 4952

Answers (2)

muma
muma

Reputation: 345

You can also do it now with an UDF:

from pyspark.sql import functions as F
from pyspark.sql.types import BooleanType
import re as re

def filter_fn(s):
     return re.search('(facebook|flipboard|linkedin|pinterest|reddit|twitter)', s, re.IGNORECASE) is not None


filter_udf = F.udf(filter_fn, BooleanType())

sessions_filtered = sessions.filter(filter_udf(sessions['social']))

Upvotes: 2

santon
santon

Reputation: 4625

How about the following:

>>> rdd = sc.parallelize([Row(name='bob', social='TWITter'), 
                          Row(name='steve', social='facebook')])
>>> df = sqlContext.createDataFrame(rdd)
>>> df.where("LOWER(social) LIKE 'twitter'").collect()
[Row(name=u'bob', social=u'TWITter')]

You can do that for all of the social networks you want if you need the actual regular expression. Otherwise, if the match is exact, you can do something like this:

>>> df.where("LOWER(social) IN ('twitter', 'facebook')").collect()
[Row(name=u'bob', social=u'TWITter'), Row(name=u'steve', social=u'facebook')]

Upvotes: 2

Related Questions