sanjayr
sanjayr

Reputation: 1939

Create Column by parsing string from another column

I have two dataframes that I would like to join.

One dataframe is like this, where syscode_ntwrk is split up by dash.

spark.createDataFrame(
    [
        (1, '1234 - ESPN'), 
        (2, '1234 - ESPN'),
        (3, '963 - CNN'), 
        (4, '963 - CNN'),
    ],
    ['id', 'col1'] 
)

And the other is in this format, where syscode_ntwrk is concatenated together.

spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

Is there a way in the 2nd dataframe to create a new column to match the first dataframe for syscode_ntwrk? Syscode will always be a group of numbers, and ntwrk will always be a group of letters, so is there a regex to add a space dash space in between the two?

Upvotes: 0

Views: 301

Answers (1)

mck
mck

Reputation: 42352

You can use regexp_extract to extract the groups and use concat_ws to convert the groups into what you want.

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        (100, '1234ESPN'), 
        (297, '1234ESPN'),
        (3989, '963CNN'), 
        (478, '963CNN'),
    ],
    ['counts', 'col1'] 
)

df.select(
    F.concat_ws(
        ' - ',
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 1),
        F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 2)
    ).alias('parsed')
).show()

+-----------+
|     parsed|
+-----------+
|1234 - ESPN|
|1234 - ESPN|
|  963 - CNN|
|  963 - CNN|
+-----------+

Upvotes: 1

Related Questions