Python Regex Find Word with Random White Space Mixed in

Question

How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?

I've got a string that has been extracted from a pdf document that has a table structure. As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces. The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces. For example: "sta ck over flow".

The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library.

I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.

Here's an example of what I've been trying to do.

my_string = "find the ans weron sta ck over flow" 
# r's\s*t\s*a\s*c\s*k\s*'  # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)

Any suggestions?

Błotosmętek · Accepted Answer

Actually what you're doing is the best way. The only addition I can suggest is to dynamically construct such regexp from a word:

word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow" 
my_cleaned_string = re.sub(regexp, '', my_string)

Python Regex Find Word with Random White Space Mixed in

Answers (2)

Related Questions