Timat
Timat

Reputation: 43

Get historical spelling corrected

Hello everyone I am posting this concern for my first time. I am writing a python script to make a program which will return standard words form. I base on rules to transform a historical text(spelling normalization). Here, the code does not work properly. It merely displays the modified word but not the entire file. Please, I ask for ideas on how to solve.

import re, string, unicodedata
from nltk.corpus import stopwords
import spacy
import codecs

nlp = spacy.load('fr')
with codecs.open(r'/home/m16/fatkab/RD_project/corpus.txt', encoding='utf8')as f:
    word =f.read()
    tokens = re.split(r'\W+', word)
    print (tokens)

for word in tokens:
    rule1 = word.replace('y', 'i')

    # to avoid modifying y as a word itself:
    if word.endswith ('y')and len(word) >= 2:
        print(rule1)

my sample input: Or puis que Dieu est ainsi descendu à nous,qu'il luy a pleu de nous communiquer ainsi sa bonté : n'est ce pas raison que nous soyons du tout siens? Et d'autant qu'il nous a tendu la main pour nous racheter, ne faut-il pas que nous soyons son heritage, quand il nous a acquis par sa vertu? Le peuple donc s'il eust eu vn grain de prudence , deuoit bien se ranger en toute humilité pour receuoir la doctrine qui luy estoit preschee par Moyse. Et mesme quelle authorite meritoit la Loy , qui estoit ainsi approuuee par tant de miracles?Car Dieu ne commande pas simplement à Moyse de parler, apres l'auoir choisi pour son prophete:mais il le tire en la montagne, il le separe de la compagnie des hommes,afin que quand il viendra mettre en auant la Loy,qu'on le tienne comme vn Ange,& non point comme vne creature mortelle.

here is the output

lui
lui
lui
ai
oui
Loi
lui
foi
Loi
hui
soi
lui
lui
lui
ci
Loi
soi
lui
ai
lui
lui
doi
quoi
soi
ai
lui
lui
soi
# the language is French

Upvotes: 1

Views: 85

Answers (1)

user955340
user955340

Reputation:

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re
text = "ouy you are the best luy guy in the try"
sub_pattern = re.compile(r"y(\W+|$)")
print(re.sub(sub_pattern, r"i\1", text))
# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference \1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

re.compile - if you're using the same regex over and over, compiling it once saves the machine having to keep re-computing it. In this case, it's just used to separate that regex onto it's own line for clarity.

r"y(\W+|$)" - the r tells python to treat the string as raw, that is backslashes will not escape characters incorrectly. To match the "y"s at the end of strings, the rule is "a 'y' followed by non-word characters, or the end of the string ($)". This is the pattern we use to match all the "incorrect" 'y' endings in the input. Note that the whitespace is captured in a group () so we can use it in the backreference later.

r"i\1"1 - First we want to replace the matched y+whitespace with an "i" as per your rules. Then, we need to ensure we put the whitespace back in - which we do with the backreference \1 which adds whatever content was captured by group1 in our pattern (\W+|$).


Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=\W+|$)")
print(re.sub(sub_pattern, r"i", text))
# oui you are the best lui gui in the tri

Note that the whitespace matching pattern is now prepended with ?= which denotes it is a non-capturing lookahead. This means it will check that these characters exist after the "y" but it does not remove them from the string during the replacement. As such, the replacement only needs to replace with "i" as the whitespace will not be modified.

Upvotes: 2

Related Questions