Reputation: 47
I have a set of files in which I've tagged the beginning of paragraphs and sentences, but I need to iterate over that each file so that each paragraph and each sentence in a file has a unique numerical ID. I believe that this can be done with str.replace or with the Regular Expression module.
In the external files, sentence opening tags are marked as follows:
<p id="####"> # 4 for paragraphs
<s id="#####"> # 5 for sentences
So here I do the work of calling the external files and calling the paragraph and sentence numbering functions (in separate module), which doesn't work.
import re, fileinput, NumberRoutines
ListFiles = ['j2vch34.txt', '79HOch16.txt']
with fileinput.input(files=(ListFiles), inplace=True, backup='.bak') as f:
for filename in ListFiles:
with open(filename) as file:
text = file.read() # read file into memory
text = NumberRoutines.NumberParas(text)
text = NumberRoutines.NumberSentences(text)
with open(filename, 'w') as file:
file.write(text)
In NumberRoutines, I've tried to apply numbering, this with example of paragraphs:
def NumberParas(text):
sub = "p id="
str = text
totalparas = str.count(sub, 0, len(str))
counter = 0
for paranumber in range(totalparas):
return str.replace('p id="####"', 'p id="{paranumber}"'.format(**locals()))
counter += 1
Following R Nar's response below, I have repaired that from earlier, so that I no longer get an error. It re-writes the file, but the paranumber is always 0.
The second way that I've tried to apply numbering, this time with sentences:
def NumberSentences(text):
sub = "s id="
str = text
totalsentences = str.count(sub, 0, len(str))
counter = 0
for sentencenumber in range(totalsentences):
return str.replace('s id="#####"', 's id="{counter}"'.format(**locals()))
counter += 1
Former type error (Can't convert 'int' object to str implicitly) resolved.
It's reading and rewriting the files, but all sentences are being numbered 0.
Two other questions: 1. Do I need the **locals for local scoping of variables inside the for statement? 2. Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex.
I have read https://docs.python.org/3.4/library/stdtypes.html#textseq And chapter 13 of Mark Summerfields Programming in Python 3, and was influenced by Dan McDougall's answer on Putting a variable inside a string (python)
Several years ago I struggled with the same thing in PERL, 2009 Query to PERL beginners, so sigh.
Upvotes: 3
Views: 1254
Reputation: 5515
i dont know why you have the inputfile
line if you are already going to iterate through each file inside of the with block so I jsut took it out
for filename in ListFiles:
with open(filename) as file:
text = file.read()
text = NumberRoutines.NumberParas(text)
text = NumberRoutines.NumberSentences(text)
with open(filename, 'w') as file:
file.write(text) # produces error on this line
this uses the same logic. however, with your code, your writing block was outside of the for loop and would then only write to your last file in the file list.
now with the functions:
def NumberParas(text):
#all that starting stuff can be eliminated with the for loop below
returnstring = ''
for i, para in enumerate(text.split('p id="####"')): # minor edit to match spacing in sample.
if i:
returnstring = returnstring + 'p id = "%d"%s' % (i-1,para)
else:
returnstring = para
return returnstring
and similarily:
def NumberSentences(text):
returnstring = ''
for i, sent in enumerate(text.split('s id="#####"')): # minor edit to match spacing.
if i:
returnstring = returnstring + 's id = "%d"%s' % (i-1,sent) # minor edit for "sent" in this isntance
else:
returnstring = sent
return returnstring
the reason that i changed the logic is because str.replace
replaces all instances of whatever you want to replace, not just the first. that means that the first time you call it, all tags would be replaced in the text and the rest of the for loop is useless. also, you need to actually return the string rather than just changing it in the function since string are immutable and so the string you have inside of the function is NOT the real string you want to change.
the internal if i:
line is because the first item in the enumerated list is whatever is before the first tag. i assume that would be empty since the tags are before each sentence/paragraph but you may have whitespace or such
BTW: this can all be accomplished with a one liner because python:
>>> s = 'p tag asdfawegasdf p tag haerghasdngjh p tag aergaedrg'
>>> ''.join(['p tag%d%s' % (i-1, p) if i else p for i,p in enumerate(s.split('p tag'))])
'p tag0 asdfawegasdf p tag1 haerghasdngjh p tag2 aergaedrg'
Upvotes: 1
Reputation: 1736
TypeError: must be str, not None
Your NumberParas(text)
returns nothing
TypeError: Can't convert 'int' object to str implicitly
Convert int i
to str
with str(i)
- Do I need the **locals for local scoping of variables inside the for statement?
You need the locals
() function call to build your parameter dict automatically.
- Can this be done with RegEx? Despite many tries, I could not get the {} for replacing with variable value to work with regex
#!/usr/bin/env python3
import re
tok='####'
regex = re.compile(tok)
bar = 41
def foo(s):
bar = 42
return regex.sub("%(bar)i" % locals(), s)
s = 's id="####"'
print(foo(s))
output:
s id="42"
Upvotes: 1