Ben Smith
Ben Smith

Reputation: 380

No luck finding regex pattern python

I am having no luck getting anything from this regex search.
I have a text file that looks like this:

REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~

I want to extract the lines that begin with "REF*23*" and ending with the "~"

txtfile = open(i + fileName, "r")
for line in txtfile:
    line = line.rstrip()
    p = re.findall(r'^REF*23*.+~', line)
    print(p)

But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks

Upvotes: 3

Views: 86

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line)

print(results)

If you need to get the digit chunks:

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line[7:-1]) # Just grab the slice

See non-regex approach demo.

NOTES

  • In a regex, * must be escaped to match a literal asterisk
  • You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
  • Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like

    m = re.search(r'^REF\*23\*(.*)~$', line): if m: results.append(m.group(1)) # To grab just the contents between delimiters # or results.append(line) # To get the whole line

    See this Python demo

  • In your case, you search for lines that start and end with fixed text, thus, no need using a regex.

Edit as an answer to the comment

Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.

You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:

results = []
with open(fileName, "r") as txtfile:
    for line in txtfile:
        res = re.findall(r'REF\*0F\*(\d+)~', line)
        if len(res):
            results.extend(res)

print(results)

Upvotes: 4

charmoniumQ
charmoniumQ

Reputation: 5513

* is a special character in regex, so you have to escape it as @The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.

r'^REF\*23\*.+~'

or

'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine

will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.

Additional changes

You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.

results = []
with open(i + fileName, "r") as txtfile:
    line = line.rstrip()
    p = re.match(r'^REF\*23\*(.+)~', line)
    if p:
        results.append(int(p.group(1)))

Consider using a regex tester such as this one.

Upvotes: 1

Jan
Jan

Reputation: 43169

You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):

raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""

result = [digits[:-1]
          for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
          for splitted in [line.split("*")]
          for digits in [splitted[-1]]]
print(result)

This yields

['526344060']

Upvotes: 1

Related Questions