English Prof WRaabe
English Prof WRaabe

Reputation: 47

Python: Count items, store count as variable, for statement with string replace to number items in external file

I have a set of files in which I've tagged the beginning of paragraphs and sentences, but I need to iterate over that each file so that each paragraph and each sentence in a file has a unique numerical ID. I believe that this can be done with str.replace or with the Regular Expression module.

In the external files, sentence opening tags are marked as follows:

<p id="####"> # 4 for paragraphs
<s id="#####"> # 5 for sentences

So here I do the work of calling the external files and calling the paragraph and sentence numbering functions (in separate module), which doesn't work.

import re, fileinput, NumberRoutines
ListFiles = ['j2vch34.txt', '79HOch16.txt']

with fileinput.input(files=(ListFiles), inplace=True, backup='.bak') as f:
    for filename in ListFiles:
        with open(filename) as file: 
            text = file.read() # read file into memory
        text = NumberRoutines.NumberParas(text)
        text = NumberRoutines.NumberSentences(text)

    with open(filename, 'w') as file: 
        file.write(text) 

In NumberRoutines, I've tried to apply numbering, this with example of paragraphs:

def NumberParas(text):
    sub = "p id="
    str = text
    totalparas = str.count(sub, 0, len(str))
    counter = 0

    for paranumber in range(totalparas):
        return str.replace('p id="####"', 'p id="{paranumber}"'.format(**locals()))
        counter += 1

Following R Nar's response below, I have repaired that from earlier, so that I no longer get an error. It re-writes the file, but the paranumber is always 0.

The second way that I've tried to apply numbering, this time with sentences:

def NumberSentences(text):
    sub = "s id="
    str = text
    totalsentences = str.count(sub, 0, len(str))
    counter = 0

    for sentencenumber in range(totalsentences):
        return str.replace('s id="#####"', 's id="{counter}"'.format(**locals()))
        counter += 1

Former type error (Can't convert 'int' object to str implicitly) resolved.

It's reading and rewriting the files, but all sentences are being numbered 0.

Two other questions: 1. Do I need the **locals for local scoping of variables inside the for statement? 2. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex.

I have read https://docs.python.org/3.4/library/stdtypes.html#textseq And chapter 13 of Mark Summerfields Programming in Python 3, and was influenced by Dan McDougall's answer on Putting a variable inside a string (python)

Several years ago I struggled with the same thing in PERL, 2009 Query to PERL beginners, so sigh.

Upvotes: 3

Views: 1254

Answers (2)

R Nar
R Nar

Reputation: 5515

i dont know why you have the inputfile line if you are already going to iterate through each file inside of the with block so I jsut took it out

for filename in ListFiles:
    with open(filename) as file: 
        text = file.read()
    text = NumberRoutines.NumberParas(text)
    text = NumberRoutines.NumberSentences(text)
    with open(filename, 'w') as file: 
        file.write(text) # produces error on this line

this uses the same logic. however, with your code, your writing block was outside of the for loop and would then only write to your last file in the file list.

now with the functions:

def NumberParas(text):
    #all that starting stuff can be eliminated with the for loop below
    returnstring = ''
    for i, para in enumerate(text.split('p id="####"')): # minor edit to match spacing in sample.
        if i:
            returnstring = returnstring + 'p id = "%d"%s' % (i-1,para)
        else:
            returnstring = para
    return returnstring

and similarily:

def NumberSentences(text):
    returnstring = ''
    for i, sent in enumerate(text.split('s id="#####"')): # minor edit to match spacing.
        if i:
            returnstring = returnstring + 's id = "%d"%s' % (i-1,sent) # minor edit for "sent" in this isntance
        else:
            returnstring = sent
return returnstring

the reason that i changed the logic is because str.replace replaces all instances of whatever you want to replace, not just the first. that means that the first time you call it, all tags would be replaced in the text and the rest of the for loop is useless. also, you need to actually return the string rather than just changing it in the function since string are immutable and so the string you have inside of the function is NOT the real string you want to change.

the internal if i: line is because the first item in the enumerated list is whatever is before the first tag. i assume that would be empty since the tags are before each sentence/paragraph but you may have whitespace or such

BTW: this can all be accomplished with a one liner because python:

>>> s = 'p tag asdfawegasdf p tag haerghasdngjh p tag aergaedrg'
>>> ''.join(['p tag%d%s' % (i-1, p) if i else p for i,p in enumerate(s.split('p tag'))])
'p tag0 asdfawegasdf p tag1 haerghasdngjh p tag2 aergaedrg'

Upvotes: 1

decltype_auto
decltype_auto

Reputation: 1736

TypeError: must be str, not None

Your NumberParas(text) returns nothing

TypeError: Can't convert 'int' object to str implicitly

Convert int i to str with str(i)

  1. Do I need the **locals for local scoping of variables inside the for statement?

You need the locals() function call to build your parameter dict automatically.

  1. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex
#!/usr/bin/env python3
import re

tok='####'
regex = re.compile(tok)

bar = 41
def foo(s):
    bar = 42
    return regex.sub("%(bar)i" % locals(), s)

s = 's id="####"'
print(foo(s))

output:

s id="42"

Upvotes: 1

Related Questions