Fill the index number based on the splitted word count in pyspark

Question

The Dataframe (Pyspark)

+-----------+---+-----------+
|       Name|Age|word_count |
+-----------+---+-----------+
|      John | 23|         1 |
|Paul Din O.| 45|         3 |        
|Kelvin Tino| 12|         2 |
+-----------+---+-----------+

Expected Output:

+-----------+---+-----------+
|       Name|Age|word_index |
+-----------+---+-----------+
|      John | 23|         0 |
|      Paul | 45|         0 |
|       Din | 45|         1 |
|        O. | 45|         2 |        
|    Kelvin | 12|         0 |
|      Tino | 12|         1 |
+-----------+---+-----------+

Goals:

Generate the duplicate rows by splitting the name field.
Re-index on each block of the splitted name. (i.e Index will be reset for every name)

Calculated the count by the below code,

def countc(inp='', search_chars='', modifier=None):
"""
Counts the number of characters that appear or do not appear in a list of characters.
"""
# Modifier(s)
if modifier is not None: 
    modifier = modifier.lower()
    if modifier == 'i':
        # Ignore case
        inp = inp.lower()

count = 0
for c in search_chars:
    count += inp.count(c)
return count

udf_countc = F.udf(lambda x, y: countc(x, y), IntegerType())
# spark.udf.register('udf_countc', udf_countc)
df = df.withColumn('word_count', udf_countc(F.col('Name'), F.lit(' ')))

Generated the duplicate rows by the below code.

df.withColumn('DuplicatedRow', F.expr('explode(array_repeat(name, F.col('word_count')))')).show()

How we can place the splitted words into each cell and fill the index on each block of the splitted name field?

mck · Accepted Answer

posexplode does exactly what you want:

import pyspark.sql.functions as F

df2 = df.select(F.posexplode(F.split('Name', ' ')).alias('word_index', 'Name'), 'Age')

df2.show()
+----------+------+---+
|word_index|  Name|Age|
+----------+------+---+
|         0|  John| 23|
|         0|  Paul| 45|
|         1|   Din| 45|
|         2|    O.| 45|
|         0|Kelvin| 12|
|         1|  Tino| 12|
+----------+------+---+

For your edited question and comment,

def countc(inp='', search_chars='', modifier=None):
    """
    Counts the number of characters that appear or do not appear in a list of characters.
    """
    # Modifier(s)
    if modifier is not None: 
        modifier = modifier.lower()
        if modifier == 'i':
            # Ignore case
            inp = inp.lower()    
    count = 0
    for c in search_chars:
        count += inp.count(c)
    return list(range(count+1))

udf_countc = F.udf(lambda x, y: countc(x, y), 'array')
df2 = df.withColumn('word_count', F.explode(udf_countc(F.col('Name'), F.lit(' '))))

df2.show()
+-----------+---+----------+
|       Name|Age|word_count|
+-----------+---+----------+
|       John| 23|         0|
|Paul Din O.| 45|         0|
|Paul Din O.| 45|         1|
|Paul Din O.| 45|         2|
|Kelvin Tino| 12|         0|
|Kelvin Tino| 12|         1|
+-----------+---+----------+

Fill the index number based on the splitted word count in pyspark

Answers (1)

Related Questions