Reputation: 380
I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks
Upvotes: 3
Views: 86
Reputation: 626853
You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
NOTES
*
must be escaped to match a literal asteriskre.findall(r'^REF*23*.+~', line)
makes little sense as the re.findall
method is used to get multiple matches while you expect oneYour regex is not anchored on the right, you need $
or \Z
to match ~
at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with
REF*0F*
and ends with~
, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F*
and ~
:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)
Upvotes: 4
Reputation: 5513
*
is a special character in regex, so you have to escape it as @The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
You might also want to throw parens around .+
to match the group, if you want to match it. Also change the findall
to match
, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.
Upvotes: 1
Reputation: 43169
You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']
Upvotes: 1