user2543622
user2543622

Reputation: 6756

pyspark.sql data.frame understanding functions

I am taking a mooc.

It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?

I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects

from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    sentence=lower(column)

    return sentence

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                         (' No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))

Upvotes: 2

Views: 2086

Answers (3)

Abdalrahman
Abdalrahman

Reputation: 466

   return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

Upvotes: 0

Leonel Atencio
Leonel Atencio

Reputation: 474

This is how i managed to do it:

lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)

return trimmed_np_lowered

Upvotes: 0

Juan Carlos
Juan Carlos

Reputation: 45

You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().

And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:

a_string = "StringToConvert"
a_string.lower()                     # "stringtoconvert"

However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.

Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions

Upvotes: 2

Related Questions