Reputation: 13
I'm trying to break giant text blocks into readable text. I want to do this by inserting a newline character after the third instance of a period followed by a space. (Or, to say that more accurately, I want to replace every third occurrence of ". "
with ". \n"
)
I ran queries in ChatGPT for about an hour and got a dozen or so wrong answers. It's the "third occurrence" part that's stumping me. I know how to ask for ANY occurrence, but not the nth occurrence.
I generally run regex in SublimeText.
Patterns that failed:
(?:(?:[^.\n]*[.]){2}[^.\n]*[.][ \n])
(?:[^\.\n]*\.[^\.\n]*){2} [^\.\n]*
(?<=\..*\..*\..*)\.
And so on.
This is a sample input text that should match for every third instance of a period followed by a space:
A person under pressure will do things that he or she might not do under normal circumstances. If a person is threatened with losing his home or his family, he may turn to fraud as a means to relieve that financial pressure. Often these individuals have been with the organization for many years and occupy positions of extreme trust. These individuals can be called accidental fraudsters. They are seemingly law abiding, honest people, but when faced with extreme financial pressure, they turn to fraud. This segment will begin by defining some of the basic elements of fraud. We will also discuss the cost of fraud and the importance of understanding how it occurs. We will examine some of the leading theories on why people commit fraud and how that information can be used to help us prevent it.
This sample text will not match (all but one of the periods have been removed):
The difference is that criminal cases must meet a higher burden of proof For example, an employee steals $100,000 from his employer by setting up a phony company and submitting false invoices for services that are not performed That conduct is criminal because he's stealing funds through deception, but the company has also been injured the employee's actions and can sue in civil court to get its money back One of the largest causes of fraud involves asset misappropriations Asset misappropriation is simply the theft or misuse of an organization's assets. Common examples include skimming revenues, stealing inventory, obtaining fraudulent payments, and payroll fraud Corruption entails the wrongful or unlawful misuse of influence in a business transaction to procure a personal benefit contrary to an individual's duty to their employer or the rights of another Common examples include accepting kickbacks, demanding extortion or engaging in conflicts of interest. Financial statement fraud involves the intentional misrepresentation of financial or nonfinancial information to mislead others who are relying on it to make economic decisions
Upvotes: 0
Views: 360
Reputation: 110685
I understand Sublime Text uses the Perl Compatible Regular Expressions (PCRE) engine from the Boost library. You therefore can use the following regular expression to replace every third occurrence of ". "
with ". \n"
.
(?:(?:(?!\. ).)*\. ){2}(?:(?!\. ).)*\K\.\s
with the g
("global, do not return after first match") and s
("dot matches newline") flags set.
The regular expression can be broken down as follows.
(?: # begin a non-capture group
(?: # begin a non-capture group
(?!\.[ ]) # negative lookahead asserts that the following two
# characters are not ". "
. # match any character
)* # end inner non-capture group and execute it zero or more times
\.[ ] # match ". "
){2} # end the outer non-capture group and execute it twice
(?: # begin a non-capture group
(?!\.[ ]) # negative lookahead asserts that the following two
# characters are not ". "
. # match any character
)* # end non-capture group and executed zero or more times
\K # reset string pointer to current location and discard all
# previously-matched characters
\.[ ] # match third instance of ". "
Note that in the above I have replaced each space with a character class containing a space ([ ]
) merely to make the space visible.
You may also find it helpful to hover the cursor over each part of the regular expression at the demo link to obtain an explanation of its function.
The expression
(?:(?!\. ).)
matches any single character provided it is not a period and is not followed by a space (as demanded by the negative lookahead (?!\. )
).
This construct is sometimes called the Tempered Greedy Token Solution.
Alternatively, replace (zero-width) matches of the following regular expression with a newline.
(?:(?:(?!\. ).)*\. ){2}(?:(?!\. ).)*\. \K
Upvotes: 2
Reputation: 781310
You're missing the space after the .
Find: (?:.*?\.\s){3}
Replace with: $&\n
$&
in the replacement text represents everything matched by the regexp.
Upvotes: 2