Zoey Malkov
Zoey Malkov

Reputation: 832

Return all regex matches as new rows

I have a reviews table as follows:

r_id my_comment
1 Boxes with the TID 823 cannot exceed 40 kg
2 Parcel with the marking tid 63157 must not make the weight go over 31 k.g
3 Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg
4 TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg
5 Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG

I am trying to match 2 things. The TID and the weight (kg). As you can see there are a 3 things to keep in mind

So I am able to extract the TID and the weight if the comment only has 1 weight and 1 TID. However, if it has multiple, I fail to do so. So I want to separate the multiple into different rows.

This is my desired output

r_id tid weight my_comment
1 823 40 Boxes with the TID 823 cannot exceed 40 kg
2 63157 31 Parcel with the marking tid 63157 must not make the weight go over 31 k.g
3 104124 94.477 Envelopes with TID 104124 and TID 92341 can't excel above 94.477kg
3 92341 Envelopes with TID 104124 and TID 92341 can't excel above 94.477kg
4 38204 45.4 TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg
4 8242602 92 TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg
5 94514 52 Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG
5 51 Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG

SQL to create table/dummy data:

CREATE TABLE reviews(
  r_id number(3) NOT NULL,
  my_comment VARCHAR(255) NOT NULL
);

INSERT INTO reviews (r_id, my_comment) VALUES (1, 'Boxes with the TID 823 cannot exceed 40 kg');
INSERT INTO reviews (r_id, my_comment) VALUES (2, 'Parcel with the marking tid 63157 must not make the weight go over 31 k.g');
INSERT INTO reviews (r_id, my_comment) VALUES (3, 'Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg');
INSERT INTO reviews (r_id, my_comment) VALUES (4, 'TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg');
INSERT INTO reviews (r_id, my_comment) VALUES (5, 'Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG');

In my attempt, I am able to extract the tid and weight, but only the first instance and not able to split it into rows.

SELECT
    r_id,

   REGEXP_SUBSTR (
        REGEXP_SUBSTR (my_comment, '(tid).*?[0-9]+', 1, 1, 'i'),
        '[0-9]+'
    ) as "tid",

    REGEXP_SUBSTR (
        REGEXP_SUBSTR (my_comment, '(cannot exceed|go over| excel above).*?[0-9]+ ?(kg|k.g)', 1, 1, 'i'),
        '[0-9]+'
    ) as "weight"

FROM reviews;

Upvotes: 2

Views: 275

Answers (1)

Littlefoot
Littlefoot

Reputation: 142798

I am able to extract the tid and weight, but only the first instance and not able to split it into rows.

Your query, modified:

  • I didn't do much with what you already wrote as you seem to be satisfied with extracted tid and weight
    • what I did change, was regexp_substr's occurrence parameter (was 1, now it is column_value)
  • in order to get split data, cross join was added which "loops" through my_comment as many times as there's greatest occurrence between tid and kg (in any form)
    • for example, if there are 2 tid and 1 kg, it'll "loop" 2 times
    • it is also used to avoid duplicates you'd get if you used only connect by level clause

You did tag the question as Oracle 10; I don't have it any more, but I know it doesn't support regexp_count function. If that's really the case (you never answered Koen's question), then it won't work and you'll have to calculate number of tid/weight occurrences using some other way. I hope you aren't on 10g, though.

I ran this code in SQL*Plus. BREAK is here just to nicely distinguish r_id and my_comment values, doesn't have any other purpose.

SQL> break on r_id on my_comment
SQL> SELECT r_id,
  2         my_comment,
  3         REGEXP_SUBSTR (REGEXP_SUBSTR (my_comment,
  4                                       '(tid).*?[0-9]+',
  5                                       1,
  6                                       COLUMN_VALUE,
  7                                       'i'),
  8                        '[0-9]+') AS "tid",
  9         REGEXP_SUBSTR (
 10            REGEXP_SUBSTR (
 11               my_comment,
 12               '(cannot exceed|go over| excel above).*?[0-9]+ ?(kg|k.g)',
 13               1,
 14               COLUMN_VALUE,
 15               'i'),
 16            '[0-9]+') AS "weight"
 17    FROM reviews
 18         CROSS JOIN
 19         TABLE (
 20            CAST (
 21               MULTISET (
 22                      SELECT LEVEL
 23                        FROM DUAL
 24                  CONNECT BY LEVEL <= GREATEST (REGEXP_COUNT (my_comment, 'tid'     , 1, 'i'),
 25                                                REGEXP_COUNT (my_comment, '(kg|k.g)', 1, 'i')))
 26                  AS SYS.odcinumberlist));

which results in

 R_ID MY_COMMENT                                                                tid     weight
----- ------------------------------------------------------------------------- ------- -------
    1 Boxes with the TID 823 cannot exceed 40 kg                                823     40
    2 Parcel with the marking tid 63157 must not make the weight go over 31 k.g 63157   31
    3 Envelopes with TID 104124 and TID 92341 cant excel above 94.477kg         104124  94
                                                                                92341
    4 TID38204 cannot go over 45.4 kg and TID 8242602 cannot go over 92kg       38204   45
                                                                                8242602 92
    5 Box with the TID 94514 cannot go over 52kg but also cannot go over 51KG   94514   52
                                                                                        51

8 rows selected.

SQL>

Upvotes: 2

Related Questions