Reputation: 1815
I have a dataframe that looks like below
temp = spark.createDataFrame([
(0, ['This','is','Spark']),
(1, ['I','wish','Java','could','use','case','classes']),
(2, ['Data','science','is','cool']),
(3, ['Machine','Learning'])
], ["id", "words"])
+---+------------------------------------------+
|id |words |
+---+------------------------------------------+
|0 |[This, is, Spark] |
|1 |[I, wish, Java, could, use, case, classes]|
|2 |[Data, science, is, cool] |
|3 |[Machine, Learning] |
+---+------------------------------------------+
I want to convert the above column words into lower case by keeping the structure in place.
The schema of the above dataframe is
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
I am applying an udf to convert the words into lower case
def lower(token):
return list(map(str.lower,token))
lower_udf = F.udf(lower)
df_mod1 = df_mod1.withColumn('token',lower_udf("words"))
After performing the above step my schema is changing. The token column is changing to string datatype from ArrayType()
|-- token: string (nullable = true)
I applied F.array() on the column, but it is enclosing on extra list
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |[[This, is, Spark]] |
|1 |[[I,wish,Java,could,use,case,classes]] |
|2 |[[Data,science,is,cool]] |
|3 |[[Machine,Learning]] |
+---+--------------------------------------+
Desired output:
+---+--------------------------------------+
|id |token |
+---+--------------------------------------+
|0 |[this, is, spark] |
|1 |[i,wish,java,could,use,case,classes] |
|2 |[data,science,is,cool] |
|3 |[machine,learning] |
+---+--------------------------------------+
|-- token: array (nullable = true)
| |-- element: string (containsNull = true)
How to convert the token type to array type after making to lower case?
Upvotes: 2
Views: 1316
Reputation: 1328
you can also use to_json and convert to lower.
(spark.sparkContext.parallelize([(0, ['This','is','Spark']),
(1, ['I','wish','Java','could','use','case','classes']),
(2, ['Data','science','is','cool']),
(3, ['Machine','Learning'])
]).toDF(['Num','arry'])
.withColumn('arry1',func.lower(func.to_json('arry')))
).show(truncate=False)
and it shows as
+---+------------------------------------------+--------------------------------------------------+ |Num|arry |arry1 | +---+------------------------------------------+--------------------------------------------------+ |0 |[This, is, Spark] |["this","is","spark"] | |1 |[I, wish, Java, could, use, case, classes]|["i","wish","java","could","use","case","classes"]| |2 |[Data, science, is, cool] |["data","science","is","cool"] | |3 |[Machine, Learning] |["machine","learning"] | +---+------------------------------------------+--------------------------------------------------+
Upvotes: 0
Reputation: 281
Instead of applying F.array()
you can use F.split()
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html
It might be an easier workflow to flatten the array column into a string using F.flatten()
and to then re-split it into an array column after you've run F.lower()
on the string column, I think it should be possible to remove the UDF this way too which will hopefully improve performance
So something like
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$@!"),
(3, "Machine!!$$")
], ["id", "words"])
temp = temp.withColumn("words", F.split(
F.lower(
F.flatten(F.col("words"))), ","))
Although please check this does what you expect with the ,
as you might want to account for whitespaces or something.
Upvotes: 0
Reputation: 53
if no performance issue..
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType
_udf = F.udf(lambda x: [v.lower() for v in x], ArrayType(StringType()))
temp = temp.withColumn('words', _udf(temp['words']))
Upvotes: 1
Reputation: 6644
You can use the transform
function.
spark.sparkContext.parallelize([(['BLAH', 'Bleh', 'fOO', 'bar'],)]).toDF(['arr_col']). \
withColumn('arrcol_lower', func.expr('transform(arr_col, x -> lower(x))')). \
show(truncate=False)
# +----------------------+----------------------+
# |arr_col |arrcol_lower |
# +----------------------+----------------------+
# |[BLAH, Bleh, fOO, bar]|[blah, bleh, foo, bar]|
# +----------------------+----------------------+
Upvotes: 2