flamewheel
flamewheel

Reputation: 121

Oracle regexp_replace - Adding space to separate sentences

I am working in Oracle to fix some text. The issue is that sentences in my data have words where sentences aren't separated by spaces. For example:

  1. Sentence without space.Between sentences

  2. Sentence with question mark?Second sentence

I've tested the following replace statement in regex101 and it seems to work out there, but I can't pinpoint why it's not working in Oracle:

regexp_replace(review_text, '([^\s\.])([\.!\?]+)([^\s\.\d])', '\1\2 \3')

This should allow me to look for sentence-separating periods/exclamation points/question marks (single or grouped) and add the necessary space between sentences. I realize that there are other ways that sentences can be separated, but what I have above should cover a large majority of the use cases. The \d in the third capture group is to make sure that I'm not accidentally changing numeric values like "4.5" to "4. 5".

Before test group:

Sentence without space.Between sentences
Sentence with space. Between sentences
Sentence with multiple periods...Between sentences
False positive sentence with 4.5 Liters
Sentence with!Exclamation point
Sentence with!Question mark

After changes should look like this:

Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark

Regex101 link: https://regex101.com/r/dC9zT8/1

While all changes work as expected from regex101, my issue is that I'm getting in Oracle is that my third and fourth test cases aren't working as intended. Oracle isn't adding a space after the multiple period (ellipses) group, and regexp_replace is adding a space for "4.5". I'm not sure why this is the case, but perhaps there's some peculiarity about Oracle regexp_replace that I'm missing.

Any and all insight is appreciated. Thanks!

Upvotes: 0

Views: 859

Answers (1)

user5683823
user5683823

Reputation:

This may get you started. This will check for .?! in any combination, followed by zero or more spaces and by an uppercase letter, and it will replace "zero or more spaces" by exactly one space. This will not separate a decimal number; but it will miss sentences that begin with anything other than an uppercase letter. You may start adding conditions - if you run into difficulty please write back and we'll try to help. Referring to other regex dialects may be helpful, but it may not be the fastest way to get your answer.

with
     inputs ( str ) as (
       select 'Sentence without space.Between sentences'           from dual union all
       select 'Sentence with space. Between sentences'             from dual union all
       select 'Sentence with multiple periods...Between sentences' from dual union all
       select 'False positive sentence with 4.5 Liters'            from dual union all
       select 'Sentence with!Exclamation point'                    from dual union all
       select 'Sentence with!Question mark'                        from dual
     )
select regexp_replace(str, '([.!?]+)\s*([A-Z])', '\1 \2') as new_str
from   inputs;

NEW_STR
-------------------------------------------------------
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark

6 rows selected.

Upvotes: 2

Related Questions