Reputation: 53
I'm trying to find a way to delete all mentions of references in a text file.
I haven't tried much, as I am new to Python but thought that this is something that Python could do.
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt') as file:
wholefile = f.read()
for '(' in
I have no idea where to go from here or if what I've done is right. Any suggestions would be helpful!
Upvotes: 2
Views: 403
Reputation: 119
Try re
>>> import re
>>> re.sub(r'\(.*?\)', '', 'nonsense (nonsense, 2015)')
'nonsense '
>>> re.sub(r'\(.*?\)', '', 'qwerty (qwerty) dkjah (Smith, 2018)')
'qwerty dkjah '
import re
def remove_bracketed_words(text_from_file: string) -> string:
"""Remove all occurrences of words with brackets surrounding them,
including the brackets.
>>> remove_bracketed_words("nonsense (nonsense, 2015)")
"nonsense "
>>> remove_bracketed_words("qwerty (qwerty) dkjah (Smith, 2018)")
"qwerty dkjah "
"""
with open('random_text.txt', 'r') as file:
wholefile = file.read()
# Be care for use 'w', it will delete raw data.
whth open('random_text.txt', 'w') as file:
file.write(re.sub(r'\(.*?\)', '', wholefile))
Upvotes: 1
Reputation: 49318
You'll have an easier time with a text editing program that handles regular expressions, like Notepad++, than learning Python for this one task (reading in a file, correcting fundamental errors like for '(' in...
, etc.). You can even use tools available online for this, such as RegExr (a regular expression tester). In RegExr, write an appropriate expression into the "expression" field and paste your text into the "text" field. Then, in the "tools" area below the text, choose the "replace" option and remove the placeholder expression. Your cleaned-up text will appear there.
You're looking for a space, then a literal opening parenthesis, then some characters, then a comma, then a year (let's just call that 3 or 4 digits), then a literal closing parenthesis, so I'd suggest the following expression:
\(.*?, \d{3,4}\)
This will preserve non-citation parenthesized text and remove the leading space before a citation.
Upvotes: 1