ichafai
ichafai

Reputation: 341

Pyspark : removing special/numeric strings from array of string

To keep it simple I have a df with the following schema:

root

 |-- Event_Time: string (nullable = true)

 |-- tokens: array (nullable = true)

 |    |-- element: string (containsNull = true)

some of the elements of "tokens" have number and special characters for example:

 "431883", "r2b2", "@refe98"

Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before. I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly. Thanks

edit2:

df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
          .select(regexp_replace('elements','\\w*\\d\\w**',""))
      )

This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

Upvotes: 2

Views: 6064

Answers (3)

Adil B
Adil B

Reputation: 16806

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:

from pyspark.sql import functions as F

df_2 = df_1.withColumn("tokens", 
                F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

Upvotes: 0

ichafai
ichafai

Reputation: 341

The solution I found is (as also stated by pault in comment section):

After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.

here is the comment of pault: After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:

df2 = df_1
    .select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")        
    .alias("elements")).groupBy("Event_Time")
    .agg(collect_list("elements").alias("tokens")) 

Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

Upvotes: 1

Arun Gunalan
Arun Gunalan

Reputation: 824

from pyspark.sql.functions import *
df = spark.createDataFrame([(["@a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{@]",""))
#df_1.printSchema()
df_1.show()

+----------+------+------+
|      data|data_1|data_2|
+----------+------+------+
|[@a, b, c]|@a,b,c| a,b,c|
|        []|      |      |
+----------+------+------+

Upvotes: 1

Related Questions