Satanas
Satanas

Reputation: 171

Replace spaces with non-breaking spaces according to a specific criterion

I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.

For example:

If in a sentence, I have:

"You need to walk 5 km."

I need to replace the space between 5 and km with a non-breaking space.

So far, I have managed to do this:

import os

unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']

# iterate and read all files in the directory
for file in os.listdir():
    # check if the file is a file
    if os.path.isfile(file):
        # open the file
        with open(file, 'r', encoding='utf-8') as f:
            # read the file
            content = f.read()
            # search for exemple in the file
            for i in unites:
                if i in content:
                    # find the next character after the unit
                    next_char = content[content.find(i) + len(i)]
                    # check if the next character is a space
                    if next_char == ' ':
                        # replace the space with a non-breaking space
                        content = content.replace(i + ' ', i + '\u00A0')

But this replace all the spaces in the document and not the ones that I want. Can you help me?


EDIT

after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.

Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :

I've tried to do this :

units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']

nbsp = '\u00A0'

rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))

print(rgx)

But I'am having some trouble, do you have any ideas to share ?

Upvotes: 1

Views: 744

Answers (1)

UlfR
UlfR

Reputation: 4395

You should use re to do the replacement. Like so:

import re

text = "You need to walk 5 km or 500000 cm."
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
nbsp = '\u00A0'

print(re.sub(r'(\b\d+) (%s)\b'%'|'.join(units),r'\1%s\2'%nbsp,text))

Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:

  1. At the beginning of something \b
  2. 1 or more digits \d+
  3. One space
  4. One of the units km|m|cm|...
  5. At the end of something \b

Then we replaces the all that with the two groups with the nbsp-string between them.

See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!

Have fun :)

Upvotes: 1

Related Questions