Reputation: 71
I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb
. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009
?
Upvotes: 0
Views: 483
Reputation: 27575
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon :
followed by four numbers [0-9]{4}
followed by any two "word" characters \w{2}
. The parentheses catch just the part you want to keep, and r'\1'
means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r
before the string is there because it is necessary to interpret \1
as a raw string, and not as an escape sequence.
Hope this helps!
Upvotes: 0
Reputation: 6851
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers. (Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
Upvotes: 1