Reputation: 87
How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?
I've got a string that has been extracted from a pdf document that has a table structure. As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces. The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces. For example: "sta ck over flow".
The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library.
I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.
Here's an example of what I've been trying to do.
my_string = "find the ans weron sta ck over flow"
# r's\s*t\s*a\s*c\s*k\s*' # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)
Any suggestions?
Upvotes: 1
Views: 1202
Reputation: 12927
Actually what you're doing is the best way. The only addition I can suggest is to dynamically construct such regexp from a word:
word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow"
my_cleaned_string = re.sub(regexp, '', my_string)
Upvotes: 2
Reputation: 521914
The best you can probably do here is to just strip all whitespace and then search for the target string inside the stripped text:
my_string = "find the ans weron sta ck over flow"
my_string = re.sub(r'\s+', '', my_string)
if 'stack' in my_string:
print("MATCH")
The reason I use "best" above is that in general you won't know if a space is an actual word boundary, or just random whitespace which has been inserted. So, you can really only do as good as finding your target as a substring in the stripped text. Note that the input text 'rust acknowledge'
would now match positive for stack
.
Upvotes: 2