Reputation: 341
To keep it simple I have a df with the following schema:
root
|-- Event_Time: string (nullable = true)
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
some of the elements of "tokens" have number and special characters for example:
"431883", "r2b2", "@refe98"
Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before.
I tried regexp_replace
, explode
, str.replace
with no success maybe I didn't use them correctly.
Thanks
edit2:
df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
.select(regexp_replace('elements','\\w*\\d\\w**',""))
)
This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?
Upvotes: 2
Views: 6064
Reputation: 16806
The transform()
function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:
from pyspark.sql import functions as F
df_2 = df_1.withColumn("tokens",
F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))
Upvotes: 0
Reputation: 341
The solution I found is (as also stated by pault in comment section):
After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.
here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:
df2 = df_1
.select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")
.alias("elements")).groupBy("Event_Time")
.agg(collect_list("elements").alias("tokens"))
Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.
Upvotes: 1
Reputation: 824
from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()
+----------+------+------+
| data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
| []| | |
+----------+------+------+
Upvotes: 1