Reputation: 259
I am cleaning some data for text analysis that I extracted from PDFs. I have noticed that one of the errors is strange spacing in words that end in "y." Specifically, the final y is broken off from the word by a space: theor y
. I'm trying to use re.sub
to identify these instances and then collapse the space.
I've been able to write what I think is a good regex string (see https://regex101.com/r/M1jpe6/5), but I'm not getting the results that I expect. I suspect that I'm missing something about the re.sub
method.
Here is my toy code.
import re
string = 'this is my theor y of dance'
regex_y = r'\b\w*\b(\sy)\b'
new_string = re.sub(regex_y, 'y', string)
print(new_string)
What I expect to print from the above is
this is my theory of dance
but what it actually prints is
this is my y of dance
Since the only capturing group in my regex is (\sy)
, I expect to substitute y
with y
. Instead, it's clear that I'm matching on the bigger string theor y
and then replacing that whole thing with y
.
Why is this happening when I'm only capturing (\sy)
? How do I write my re.sub
string so it works as I intend?
Upvotes: 0
Views: 28
Reputation: 521103
Your example is a bit contrived, but if you wanted to remove whitespace before dangling y
characters, I would use this:
string = 'this is my theor y of dance'
string = re.sub(r'\b\s+y\b', 'y', string)
print(string)
this is my theory of dance
The problem with using capture groups here is that you want to display the entire input sentence, with some modifications. With a capture group approach, you would need to match and capture the entire string.
Upvotes: 1