Reputation: 1169
I am trying to reproduce the results of the django ORM query in pyspark:
social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)'
Collection.objects.filter(social__iregex=social_filter)
My main problem is that it should be case insensitive.
I have tried this:
social_filter = "social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'"
df = sessions.filter(social_filter)
which result in the following error:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.22] failure: end of input expected
social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'
And the following expression:
social_filter = "social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)"
df = sessions.filter(social_filter)
crashes with this:
Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.17] failure: identifier expected
social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
please, help!
Upvotes: 0
Views: 4952
Reputation: 345
You can also do it now with an UDF:
from pyspark.sql import functions as F
from pyspark.sql.types import BooleanType
import re as re
def filter_fn(s):
return re.search('(facebook|flipboard|linkedin|pinterest|reddit|twitter)', s, re.IGNORECASE) is not None
filter_udf = F.udf(filter_fn, BooleanType())
sessions_filtered = sessions.filter(filter_udf(sessions['social']))
Upvotes: 2
Reputation: 4625
How about the following:
>>> rdd = sc.parallelize([Row(name='bob', social='TWITter'),
Row(name='steve', social='facebook')])
>>> df = sqlContext.createDataFrame(rdd)
>>> df.where("LOWER(social) LIKE 'twitter'").collect()
[Row(name=u'bob', social=u'TWITter')]
You can do that for all of the social networks you want if you need the actual regular expression. Otherwise, if the match is exact, you can do something like this:
>>> df.where("LOWER(social) IN ('twitter', 'facebook')").collect()
[Row(name=u'bob', social=u'TWITter'), Row(name=u'steve', social=u'facebook')]
Upvotes: 2