Introduce a new column in data frame with the value based on condition in PySpark

Question

I am having the JSON data like below.

    {"images": [
    {
    "alt": null,
    "src": "link_1",
    },
    {
    "alt": null,
    "src": "link_2",
    },
    {
    "alt": "Apple",
    "src": "link_3",
    },
    {
    "alt": null,
    "src": "link_4",
    },
"images": [
    {
    "alt": "Orange",
    "src": "link_1",
    },
    {
    "alt": null,
    "src": "link_2",
    }
]}

I need to introduce a new column in a data frame with the value of src by the below condition.

Never assign the first position value. (Example: link_1)
alt should not be NULL then the value of src is assigned to the new column. If more than one alt contains value then the first alt value is picked out expect the position one.
If all the alt is equal to NULL, then the second position value of the src is assigned to the new column.

Note: images always contains more than one element.

For the above example, the expected output is

+--------------------+
|      new column    |
+--------------------+
|link_3              |
|link_2              |
+--------------------+

Can anyone help to get the expected output. Thanks in advance.

T.SURESH ARUNACHALAM · Accepted Answer

I solved this today.

def extractSecondaryImageUrl(self, *htmlValue):
    for element in htmlValue:
        if len(element) == 0:
            return ''
        if len(element) >= 2:
            element.pop(0)
            for x in element:
                if x['alt'] is not None:
                    return x['src']
            a = element.pop(0)
            return a['src']
        else:
            a = element.pop(0)
            return a['src']

    extractURL = udf(self.extractSecondaryImageUrl, StringType())

    productsDF = productsDF.select("*", extractURL("images").alias('new_column'))

Introduce a new column in data frame with the value based on condition in PySpark

Answers (1)

Related Questions