regexp in PySpark

Question

I am trying to reproduce the results of the django ORM query in pyspark:

social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)'
Collection.objects.filter(social__iregex=social_filter)

My main problem is that it should be case insensitive.

I have tried this:

social_filter = "social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'"
df = sessions.filter(social_filter)

which result in the following error:

Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.22] failure: end of input expected

social ILIKE 'facebook' OR social ILIKE 'flipboard' OR social ILIKE 'linkedin' OR social ILIKE 'pinterest' OR social ILIKE 'reddit' OR social ILIKE 'twitter'

And the following expression:

social_filter = "social  ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)"
df = sessions.filter(social_filter)

crashes with this:

Py4JJavaError: An error occurred while calling o31.filter.
: java.lang.RuntimeException: [1.17] failure: identifier expected

social  ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)
       ^
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
    at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

please, help!

santon · Accepted Answer

How about the following:

>>> rdd = sc.parallelize([Row(name='bob', social='TWITter'), 
                          Row(name='steve', social='facebook')])
>>> df = sqlContext.createDataFrame(rdd)
>>> df.where("LOWER(social) LIKE 'twitter'").collect()
[Row(name=u'bob', social=u'TWITter')]

You can do that for all of the social networks you want if you need the actual regular expression. Otherwise, if the match is exact, you can do something like this:

>>> df.where("LOWER(social) IN ('twitter', 'facebook')").collect()
[Row(name=u'bob', social=u'TWITter'), Row(name=u'steve', social=u'facebook')]

regexp in PySpark

Answers (2)

Related Questions