Reputation: 395
The Dataframe (Pyspark)
+-----------+---+-----------+
| Name|Age|word_count |
+-----------+---+-----------+
| John | 23| 1 |
|Paul Din O.| 45| 3 |
|Kelvin Tino| 12| 2 |
+-----------+---+-----------+
Expected Output:
+-----------+---+-----------+
| Name|Age|word_index |
+-----------+---+-----------+
| John | 23| 0 |
| Paul | 45| 0 |
| Din | 45| 1 |
| O. | 45| 2 |
| Kelvin | 12| 0 |
| Tino | 12| 1 |
+-----------+---+-----------+
Goals:
name
field.name
. (i.e Index will be reset for every name
)Calculated the count by the below code,
def countc(inp='', search_chars='', modifier=None):
"""
Counts the number of characters that appear or do not appear in a list of characters.
"""
# Modifier(s)
if modifier is not None:
modifier = modifier.lower()
if modifier == 'i':
# Ignore case
inp = inp.lower()
count = 0
for c in search_chars:
count += inp.count(c)
return count
udf_countc = F.udf(lambda x, y: countc(x, y), IntegerType())
# spark.udf.register('udf_countc', udf_countc)
df = df.withColumn('word_count', udf_countc(F.col('Name'), F.lit(' ')))
Generated the duplicate rows by the below code.
df.withColumn('DuplicatedRow', F.expr('explode(array_repeat(name, F.col('word_count')))')).show()
How we can place the splitted words into each cell and fill the index
on each block of the splitted name
field?
Upvotes: 1
Views: 109
Reputation: 42392
posexplode
does exactly what you want:
import pyspark.sql.functions as F
df2 = df.select(F.posexplode(F.split('Name', ' ')).alias('word_index', 'Name'), 'Age')
df2.show()
+----------+------+---+
|word_index| Name|Age|
+----------+------+---+
| 0| John| 23|
| 0| Paul| 45|
| 1| Din| 45|
| 2| O.| 45|
| 0|Kelvin| 12|
| 1| Tino| 12|
+----------+------+---+
For your edited question and comment,
def countc(inp='', search_chars='', modifier=None):
"""
Counts the number of characters that appear or do not appear in a list of characters.
"""
# Modifier(s)
if modifier is not None:
modifier = modifier.lower()
if modifier == 'i':
# Ignore case
inp = inp.lower()
count = 0
for c in search_chars:
count += inp.count(c)
return list(range(count+1))
udf_countc = F.udf(lambda x, y: countc(x, y), 'array<int>')
df2 = df.withColumn('word_count', F.explode(udf_countc(F.col('Name'), F.lit(' '))))
df2.show()
+-----------+---+----------+
| Name|Age|word_count|
+-----------+---+----------+
| John| 23| 0|
|Paul Din O.| 45| 0|
|Paul Din O.| 45| 1|
|Paul Din O.| 45| 2|
|Kelvin Tino| 12| 0|
|Kelvin Tino| 12| 1|
+-----------+---+----------+
Upvotes: 2