ImNewToThis
ImNewToThis

Reputation: 153

Reading a string and creating an array of mentioned sub-strings

I'm currently trying to solve a problem where i have a large string of text (summary) and i'm searching for certain words within that summary. Based on one of a number of words exists in a certain category i want to be able to create an array of the respective tags as outlined below:

ground = ['car', 'motorbike']
air = ['plane']
colour = ['blue', 'red']

| Summary                | Tag_Array            |
|------------------------|----------------------|
| This is a blue car     | ['ground', 'colour'] |
| This is red motorbike  | ['ground', 'colour'] |
| This is a plane        | ['air']              |

The idea here being that it reads each summary and then creates an array in the Tag_Array column that contains the respective tags associated with the summary text. The tag for ground can be based on any number of potential options in this case both motorbike and car return the tag ground.

I functionally have this working with a really awful approach and its very verbose and so my intention here is to work out the most appropriate way to achieve this in Pyspark.

    df = (df
        .withColumn("summary_as_array", f.split('summary', " "))
        .withColumn("tag_array", f.array(
            f.when(f.array_contains('summary_as_array', "car"), "ground").otherwise(""),
            f.when(f.array_contains('summary_as_array', "motorbike"), "ground").otherwise("")
            )
        )
    )

Upvotes: 0

Views: 60

Answers (1)

Suresh
Suresh

Reputation: 5870

If you could convert the tags into a key-value pairs like this,

tagDict = {'ground':['car', 'motorbike'],'air':['plane'],'colour':['blue','red']}

then we can create an UDF to iterate over words in summary & values to get keys,which will be tags. A simple solution,

l = [('This is a blue car',),('This is red motorbike',),('This is a plane',)]
df = spark.createDataFrame(l,['summary'])

tag_udf = F.udf(lambda x : [k for k,v in tagDict.items() if any(itm in x for itm in v)])
df = df.withColumn('tag_array',tag_udf(df['summary']))
df.show()
+---------------------+----------------+
|summary              |tag_array       |
+---------------------+----------------+
|This is a blue car   |[colour, ground]|
|This is red motorbike|[colour, ground]|
|This is a plane      |[air]           |
+---------------------+----------------+

Hope this helps.

Upvotes: 1

Related Questions