Keo Rithy
Keo Rithy

Reputation: 13

Python. Regular expression not returning output

I am trying to findall instances of the string "PB" and the digits that follow it, but when I call.

number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))

the ([0-9])\d+ doesn't return an output. I check my output file, sequence.txt but there is nothing inside it. If i just do \bPB\b it outputs "PB" but no numbers.

My input file, raw-sequence.txt looks like this:

WB (19, 21, 24, 46, 60)
WB (12, 11, 9, 23, 49)
PB (18, 21, 10, 5, 5)
WB (2, 14, 2, 29, 67)
WB (1, 8, 1, 16, 52)
PB (2, 11, 8, 3, 4)

How can I output the following lines to sequence.txt?

PB (18, 21, 10, 5, 5)
PB (2, 11, 8, 3, 4)

Here is my current code:

sequence_raw_buffer = open('c:\\sequence.txt', 'a')
with open('c:\\raw-sequence.txt') as f:
  number_list = f.read().splitlines()
  number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
  unique = list(set(number_all))
  for i in unique:
    sequence_raw_buffer.write(i + '\n')
  print "done"
  f.close()
  sequence_raw_buffer.close()

Upvotes: 0

Views: 77

Answers (3)

Mad Physicist
Mad Physicist

Reputation: 114320

Given the code you show, regex are an unnecessary over-complication to your problem. You can just iterate over the lines from the input file and dump the ones for which line.startswith("PB") returns True.

with open(r'c:\raw-sequence.txt', 'r') as f, open(r'c:\sequence.txt', 'a') as sequence_raw_buffer:
    for line in f:
        if line.startswith("PB"):
            print(line, file=sequence_raw_buffer)

This illustrates the fact that files can be iterated over line-by-line. I use print to dump the line because it will append the correct line terminator that the for loop strips off.

This example also shows you how to put multiple context managers into a single with block. You should have all your file in a with block, whether input or output, because I/O errors are a possibility in both directions.

Now, if you are trying to use regex for practice or because the match is really more complicated than what you present here, you can try

PB\s*\((?:\d+,\s*)*\d+\)

This matches as follows:

  • Literal PB
  • Optional unlimited number of spaces \s*
  • Literal open parens \(
  • Optional non-capturing group (?:)*, repeated as many times as necessary, containing
    • At least one digit \d+
    • Literal comma ,
    • Any number of spaces \s*
  • At least one actual number \d
  • Literal close parens \)

I would not bother concatenating the whole file together and using findall on that though, unless your expression can span multiple lines. I would prefer to still use the approach shown above, because in all but a few cases that I can think of, textual data will generally be delimited by newlines:

pattern = re.compile('PB\s*\((?:\d+,\s*)*\d+\)')
...
            if pattern.match(line):
...

Pre-compiling the pattern once makes the program run faster, but you could call re.match(..., line) every time as well.

Upvotes: 2

rock321987
rock321987

Reputation: 11032

There are few things that you are missing

  1. You are missing a space between word boundary(\b) and bracket (
  2. Parenthesis () have different meanings in regex context. Parenthesis denotes capturing group. To match parenthesis literally you need to escape it.

Now to match the exact pattern you intend, you can use this

\bPB\s+\((?:\s*\d+\s*,\s*)*\d+\)

If you want to only match lines with PB you can directly search for PB

Upvotes: 0

Alessandro Martini
Alessandro Martini

Reputation: 71

You can try this regex: PB\s?\(([0-9]*,?\s?)*\)

Upvotes: 0

Related Questions