Reputation: 233
I am having an issue. What I am trying to do is sort through data and create new lines at certain points. Currently, my code looks like this:
from __future__ import print_function
import re
NDoc = raw_input("Enter name of new document ")+".txt"
log = open(NDoc, 'w')
file = raw_input("Enter a file to be sorted ")
extfile = file+".txt"
xfile = open(file+".txt")
for line in xfile:
l=line.strip()
l=re.sub("\n","",l)
n=re.sub("(\B)(?=((MTH|HST|ENG)[|]))","\n",line)
if len(n) > 0:
nl=n.split("\n")
for item in nl:
log.write(item+"\n")
#print(item)
print ("The data from",extfile,"has been sorted into",NDoc)
Everything is working properly except that after the third term (ENG|) a new line is appearing in my data. For instance, if my datafile was like this:
MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
MTH|lettersandnumbersHST|
I would expect it to look like this:
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|
But it is instead giving me this:
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|
Now I thought doing l=re.sub("\n","",l)
would replace all \n with nothing before my new \n's were added, so why is there still an extra line being made, but only after ENG?
Thank you in advance for any insights.
Upvotes: 3
Views: 15940
Reputation: 31339
I think you're not using the proper tool.
You probably want re.sub
:
print(re.sub("([^\n])(MTH|HST|ENG)", r"\1\n\2", st))
Short explanation: This captures any of the options MTH
, HST
or ENG
with no \n
before it ([^\n]
is "anything but \n
"), and the character before, and adds a \n
between them. The result is what you expect.
Example:
>>> st = """MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
... MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
... MTH|lettersandnumbersHST|"""
>>> print(re.sub("([^\n])(MTH|HST|ENG)", r"\1\n\2", st))
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|
Upvotes: 1
Reputation: 180441
You could use findall to match either pattern:
s = """MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
MTH|lettersandnumbersHST|lettersandnumbersENG|lettersandnumbers
MTH|lettersandnumbersHST|"""
r= re.compile("([A-Z]+\|[0-9a-z]+|[A-Z]+\|)",)
for line in s.splitlines(True):
print("\n".join(r.findall(line)))
Output:
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|lettersandnumbers
ENG|lettersandnumbers
MTH|lettersandnumbers
HST|
Upvotes: 0
Reputation: 1199
You have spaces in your source data after "ENG." Just strip those out and you'll be fine.
l=re.sub(' ', '', l)
Upvotes: 0
Reputation: 49310
You're using the wrong name for your line.
l=line.strip()
l=re.sub("\n","",l)
should be
line=line.strip()
line=re.sub("\n","",line)
or simply
line=line.strip().replace('\n', '')
Upvotes: 3