merkle
merkle

Reputation: 1815

convert array type column to lower case in pyspark

I have a dataframe that looks like below

    temp = spark.createDataFrame([
    (0, ['This','is','Spark']),
    (1, ['I','wish','Java','could','use','case','classes']),
    (2, ['Data','science','is','cool']),
    (3, ['Machine','Learning'])
], ["id", "words"])
    
+---+------------------------------------------+
|id |words                                     |
+---+------------------------------------------+
|0  |[This, is, Spark]                         |
|1  |[I, wish, Java, could, use, case, classes]|
|2  |[Data, science, is, cool]                 |
|3  |[Machine, Learning]                       |
+---+------------------------------------------+

I want to convert the above column words into lower case by keeping the structure in place.

The schema of the above dataframe is

|-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)

I am applying an udf to convert the words into lower case

def lower(token):
    return list(map(str.lower,token))

lower_udf = F.udf(lower)

df_mod1 = df_mod1.withColumn('token',lower_udf("words"))

After performing the above step my schema is changing. The token column is changing to string datatype from ArrayType()

|-- token: string (nullable = true)

I applied F.array() on the column, but it is enclosing on extra list

+---+--------------------------------------+
|id |words                                 |
+---+--------------------------------------+
|0  |[[This, is, Spark]]                     |
|1  |[[I,wish,Java,could,use,case,classes]]  |
|2  |[[Data,science,is,cool]]                |
|3  |[[Machine,Learning]]                    |
+---+--------------------------------------+

Desired output:

+---+--------------------------------------+
|id |token                                 |
+---+--------------------------------------+
|0  |[this, is, spark]                     |
|1  |[i,wish,java,could,use,case,classes]  |
|2  |[data,science,is,cool]                |
|3  |[machine,learning]                    |
+---+--------------------------------------+
|-- token: array (nullable = true)
 |    |-- element: string (containsNull = true)

How to convert the token type to array type after making to lower case?

Upvotes: 2

Views: 1316

Answers (4)

practicalGuy
practicalGuy

Reputation: 1328

you can also use to_json and convert to lower.

(spark.sparkContext.parallelize([(0, ['This','is','Spark']),
                                (1, ['I','wish','Java','could','use','case','classes']),
                                (2, ['Data','science','is','cool']),
                                (3, ['Machine','Learning'])
                               ]).toDF(['Num','arry'])
 .withColumn('arry1',func.lower(func.to_json('arry')))
).show(truncate=False)

and it shows as

+---+------------------------------------------+--------------------------------------------------+
|Num|arry                                      |arry1                                             |
+---+------------------------------------------+--------------------------------------------------+
|0  |[This, is, Spark]                         |["this","is","spark"]                             |
|1  |[I, wish, Java, could, use, case, classes]|["i","wish","java","could","use","case","classes"]|
|2  |[Data, science, is, cool]                 |["data","science","is","cool"]                    |
|3  |[Machine, Learning]                       |["machine","learning"]                            |
+---+------------------------------------------+--------------------------------------------------+

Upvotes: 0

FJ_OC
FJ_OC

Reputation: 281

Instead of applying F.array() you can use F.split() https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html

It might be an easier workflow to flatten the array column into a string using F.flatten() and to then re-split it into an array column after you've run F.lower() on the string column, I think it should be possible to remove the UDF this way too which will hopefully improve performance

So something like

temp = spark.createDataFrame([
    (0, "This is Spark!!!!"),
    (1, "I wish Java could use case classes!!##"),
    (2, "Data science is  cool#$@!"),
    (3, "Machine!!$$")
], ["id", "words"])

temp = temp.withColumn("words", F.split(
        F.lower(
        F.flatten(F.col("words"))), ","))

Although please check this does what you expect with the , as you might want to account for whitespaces or something.

Upvotes: 0

Travis
Travis

Reputation: 53

if no performance issue..

from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType

_udf = F.udf(lambda x: [v.lower() for v in x], ArrayType(StringType()))
temp = temp.withColumn('words', _udf(temp['words']))

Upvotes: 1

samkart
samkart

Reputation: 6644

You can use the transform function.

spark.sparkContext.parallelize([(['BLAH', 'Bleh', 'fOO', 'bar'],)]).toDF(['arr_col']). \
    withColumn('arrcol_lower', func.expr('transform(arr_col, x -> lower(x))')). \
    show(truncate=False)

# +----------------------+----------------------+
# |arr_col               |arrcol_lower          |
# +----------------------+----------------------+
# |[BLAH, Bleh, fOO, bar]|[blah, bleh, foo, bar]|
# +----------------------+----------------------+

Upvotes: 2

Related Questions