Reputation: 1939
I have two dataframes that I would like to join.
One dataframe is like this, where syscode_ntwrk
is split up by dash.
spark.createDataFrame(
[
(1, '1234 - ESPN'),
(2, '1234 - ESPN'),
(3, '963 - CNN'),
(4, '963 - CNN'),
],
['id', 'col1']
)
And the other is in this format, where syscode_ntwrk
is concatenated together.
spark.createDataFrame(
[
(100, '1234ESPN'),
(297, '1234ESPN'),
(3989, '963CNN'),
(478, '963CNN'),
],
['counts', 'col1']
)
Is there a way in the 2nd dataframe to create a new column to match the first dataframe for syscode_ntwrk
? Syscode
will always be a group of numbers, and ntwrk
will always be a group of letters, so is there a regex to add a space dash space in between the two?
Upvotes: 0
Views: 301
Reputation: 42352
You can use regexp_extract
to extract the groups and use concat_ws
to convert the groups into what you want.
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
(100, '1234ESPN'),
(297, '1234ESPN'),
(3989, '963CNN'),
(478, '963CNN'),
],
['counts', 'col1']
)
df.select(
F.concat_ws(
' - ',
F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 1),
F.regexp_extract('col1', '(\d+)([a-zA-Z]+)', 2)
).alias('parsed')
).show()
+-----------+
| parsed|
+-----------+
|1234 - ESPN|
|1234 - ESPN|
| 963 - CNN|
| 963 - CNN|
+-----------+
Upvotes: 1