KPC2
KPC2

Reputation: 25

Pyspark - Replace portion of a string with different characters (uneven character count)

I'm trying to replace a portion of a string with a different and shorter string which has : and +. Even though the values under the Start column is time, it is not a timestamp and instead it is recognised as a string.

I've tried using regexp_replace but currently don't know how to specify the last 8 characters in the string in the 'Start' column that needs to be replaced or specify the string that I want to replace with the new one.

df = df.withColumn('replaced', regexp_replace('Start', ':00+10:00', '00Z' ))

For example: I would start with the dataframe with columns:

| Data | Qualifier | Description | Start                    |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00+10:00 |
|24    | 32        | banana      |2021-03-04T22:30:00+10:00 |
|24    | 32        | orange      |2021-03-04T11:58:00+10:00 |

And would like to have the Start column replaced with a new Column named 'Replaced' or retain 'Start' column header.

Expected Output:

| Data | Qualifier | Description | Replaced                 |
|:----:|:---------:|:-----------:|:------------------------:|
|12    | 54        | apple       |2021-03-03T02:00:00Z      |
|24    | 32        | banana      |2021-03-04T22:30:00Z      |
|24    | 32        | orange      |2021-03-04T11:58:00Z      |

Upvotes: 1

Views: 2316

Answers (2)

KPC2
KPC2

Reputation: 25

I have also been able to solve this without using regexp_replace and instead using expr and format_string, since the input was a strint and not a timestamp.

df= df.withColumn( "Start", expr("substring(Start, 1, length(Start)-8)") )

df= df.withColumn( "Start", format_string("%s00Z", "Start") )

Upvotes: 1

Mohana B C
Mohana B C

Reputation: 5487

You can convert the column type to timestamp using to_timestamp() function by providing input timestamp pattern.

If you really want to use regexp_replace then in your regex pattern you have to escape + using \

>>> df = spark.createDataFrame(['2021-03-03T02:00:00+10:00'], StringType())
>>> df.withColumn('replaced', regexp_replace("value", "\+\d+:\d+", "Z")).show(truncate=False)

+-------------------------+--------------------+
|value                    |replaced            |
+-------------------------+--------------------+
|2021-03-03T02:00:00+10:00|2021-03-03T02:00:00Z|
+-------------------------+--------------------+

Upvotes: 1

Related Questions