Minions
Minions

Reputation: 5477

Detection of quoted text in sentences

I have sentences that quote text inside them, like:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?

I am trying to mask the quoted parts with REGEX but it's not accurate. For instance, for the last sentence:

txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))

The output is:

Reread these sentences: "<quote>" mean?

Instead, it should be:

Reread these sentences: "<quote>" What does the word "courtship" mean?

Since I have > 10k instances, it's really hard to find a common REGEX pattern that works with all the cases.

My question is, is there any library (maybe implemented based on a neural network?) or approach to solve this problem?

Upvotes: 1

Views: 77

Answers (2)

Edoardo Facchinelli
Edoardo Facchinelli

Reputation: 422

Another approach could be to use a different technique than regex altogether, shlex

The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell. This will often be useful for writing minilanguages, (for example, in run control files for Python applications) or for parsing quoted strings.

shlex.split considers quotes when splitting into words, and the optional posix parameter keeps the quotes in the results. With its output, you could create a string like the one you describe.

import shlex

lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say  “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
    print(
        " ".join(
            word
            if word[0] != '"' and word[-1] != '"' else '"<quote>"'
            for word in shlex.split(line, posix=False)
        )
    )

output:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
  • Note 1: shlex does not interpret curly quotes as quotes (e.g. line 2), so if you have them you should .replace() them before feeding each line to it.
  • Note 2: this is replacing all quoted occurrences, but if you want just the first one and keep the rest you could do instead (pretty sure this can be written better, but take it as proof of concept):
for line in lines:
    new_line = []
    quote_count = 0
    for word in shlex.split(line, posix=False):
        if word[0] == '"' and word[-1] == '"':
            if quote_count < 1:
                quote_count += 1
                new_line.append('"<quote>"')
            else:
                new_line.append(word)
        else:
            new_line.append(word)
    print(' '.join(new_line))

output:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?

Upvotes: 0

Ryszard Czech
Ryszard Czech

Reputation: 18631

For these examples use

import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)

See Python proof. For various types of quotes, use separate commands, this makes it easier to control.

Results:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?

Upvotes: 1

Related Questions