Abhigyan Sarma
Abhigyan Sarma

Reputation: 35

Clean string in postgresql

I have a column in a table that contains data about any updates related to changes about a company in the below format -

#=============#==============#================#
| Company ID  |  updated_at  |   updates      |
#=============#==============#================#
| 101         | 2020-11-01   | name:          |
|             |              | -ABC           |
|             |              | -XYZ           |
|             |              | url:           |
|             |              | -www.abc.com   |
|             |              | -www.xyz.com   |
+-------------+--------------+----------------+
| 109         | 2020-10-20   | rating:        |
|             |              | -4.5           |
|             |              | -4.0           |
+-------------+--------------+----------------+

As you can see above, the column updates contains strings that include newlines and describe one or multiple updates. In the above example this means that for company ID 101, the name changed from ABC to XYZ and the url changed from www.abc.com to www.xyz.com. For company ID 109, only the rating changed from 4.5 to 4.0.

However I would like to divide the updates column into 3 columns - one should contain what was changed (url, name etc.), second should have the old value and the 3rd column should have the new value. Something like this -

#============#============#==============#================#
| Company ID |   Field    |  Old Value   |   New Value    |
#============#============#==============#================#
| 101        |   name     | ABC          | XYZ            |
+------------+------------+--------------+----------------+
| 101        |   url      | www.abc.com  | www.xyz.com    |
+------------+------------+--------------+----------------+
| 109        |   rating   | 4.5          | 4.0            |
+------------+------------+--------------+----------------+

I am doing this in Postgres and know how to extract substrings based on characters but this looks a bit complicated to me since I need to extract multiple substrings from the same column for each row. Any help would be appreciated. Thanks!

Upvotes: 1

Views: 747

Answers (1)

buddemat
buddemat

Reputation: 5311

At first, you can use regexp_split_into_table and a regex with a positive lookahead to get a version of your table in which each of the rows contains exactly one update:

select companyID, 
       updated_at, 
       regexp_split_to_table(updates, '\n(?=\y.+:)') as updates 
  from old;

This will split the column updates at any newline (\n) that is followed by a single word and a colon (\y.+:).

#=============#==============#================#
| companyID   |  updated_at  |   updates      |
#=============#==============#================#
| 101         | 2020-11-01   | name:          |
|             |              | -ABC           |
|             |              | -XYZ           |
+-------------+--------------+----------------+
| 101         | 2020-11-01   | url:           |
|             |              | -www.abc.com   |
|             |              | -www.xyz.com   |
+-------------+--------------+----------------+
| 109         | 2020-10-20   | rating:        |
|             |              | -4.5           |
|             |              | -4.0           |
+-------------+--------------+----------------+

From this, you can more easily build your desired table. To do this, you can use e.g. split_part to split the update string into the three parts you want.

Putting this together with the first part gets you the full query:

select companyID, 
       updated_at, 
       split_part(updates, E':', 1) as field, 
       split_part(updates, E'\n-', 2) as old_value, 
       split_part(updates, E'\n-', 3) as new_value  
  from (select companyID, 
               updated_at, 
               regexp_split_to_table(updates, '\n(?=\y.+:)') as updates 
          from old
       )
;

Here is a db<>fiddle example.

More details / additional info:

Upvotes: 2

Related Questions