Reputation: 5477
I have sentences that quote text inside them, like:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?
I am trying to mask the quoted parts with REGEX but it's not accurate. For instance, for the last sentence:
txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))
The output is:
Reread these sentences: "<quote>" mean?
Instead, it should be:
Reread these sentences: "<quote>" What does the word "courtship" mean?
Since I have > 10k instances, it's really hard to find a common REGEX pattern that works with all the cases.
My question is, is there any library (maybe implemented based on a neural network?) or approach to solve this problem?
Upvotes: 1
Views: 77
Reputation: 422
Another approach could be to use a different technique than regex altogether, shlex
The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell. This will often be useful for writing minilanguages, (for example, in run control files for Python applications) or for parsing quoted strings.
shlex.split
considers quotes when splitting into words, and the optional posix
parameter keeps the quotes in the results. With its output, you could create a string like the one you describe.
import shlex
lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
print(
" ".join(
word
if word[0] != '"' and word[-1] != '"' else '"<quote>"'
for word in shlex.split(line, posix=False)
)
)
output:
Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
shlex
does not interpret curly quotes as quotes (e.g. line 2), so if you have them you should .replace()
them before feeding each line to it.for line in lines:
new_line = []
quote_count = 0
for word in shlex.split(line, posix=False):
if word[0] == '"' and word[-1] == '"':
if quote_count < 1:
quote_count += 1
new_line.append('"<quote>"')
else:
new_line.append(word)
else:
new_line.append(word)
print(' '.join(new_line))
output:
Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?
Upvotes: 0
Reputation: 18631
For these examples use
import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)
See Python proof. For various types of quotes, use separate commands, this makes it easier to control.
Results:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?
Upvotes: 1