How to fix re.sub capturing in Python regex?

Question

I am cleaning some data for text analysis that I extracted from PDFs. I have noticed that one of the errors is strange spacing in words that end in "y." Specifically, the final y is broken off from the word by a space: theor y. I'm trying to use re.sub to identify these instances and then collapse the space.

I've been able to write what I think is a good regex string (see https://regex101.com/r/M1jpe6/5), but I'm not getting the results that I expect. I suspect that I'm missing something about the re.sub method.

Here is my toy code.

import re
string = 'this is my theor y of dance'
regex_y = r'\b\w*\b(\sy)\b'

new_string = re.sub(regex_y, 'y', string)
print(new_string)

What I expect to print from the above is

this is my theory of dance

but what it actually prints is

this is my y of dance

Since the only capturing group in my regex is (\sy), I expect to substitute y with y. Instead, it's clear that I'm matching on the bigger string theor y and then replacing that whole thing with y.

Why is this happening when I'm only capturing (\sy)? How do I write my re.sub string so it works as I intend?

Tim Biegeleisen · Accepted Answer

Your example is a bit contrived, but if you wanted to remove whitespace before dangling y characters, I would use this:

string = 'this is my theor y of dance'
string = re.sub(r'\b\s+y\b', 'y', string)
print(string)

this is my theory of dance

The problem with using capture groups here is that you want to display the entire input sentence, with some modifications. With a capture group approach, you would need to match and capture the entire string.

How to fix re.sub capturing in Python regex?

Answers (1)

Related Questions