Reputation: 13
I am trying to findall
instances of the string "PB"
and the digits that follow it, but when I call.
number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
the ([0-9])\d+
doesn't return an output. I check my output file, sequence.txt
but there is nothing inside it. If i just do \bPB\b
it outputs "PB"
but no numbers.
My input file, raw-sequence.txt
looks like this:
WB (19, 21, 24, 46, 60)
WB (12, 11, 9, 23, 49)
PB (18, 21, 10, 5, 5)
WB (2, 14, 2, 29, 67)
WB (1, 8, 1, 16, 52)
PB (2, 11, 8, 3, 4)
How can I output the following lines to sequence.txt?
PB (18, 21, 10, 5, 5)
PB (2, 11, 8, 3, 4)
Here is my current code:
sequence_raw_buffer = open('c:\\sequence.txt', 'a')
with open('c:\\raw-sequence.txt') as f:
number_list = f.read().splitlines()
number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
unique = list(set(number_all))
for i in unique:
sequence_raw_buffer.write(i + '\n')
print "done"
f.close()
sequence_raw_buffer.close()
Upvotes: 0
Views: 77
Reputation: 114320
Given the code you show, regex are an unnecessary over-complication to your problem. You can just iterate over the lines from the input file and dump the ones for which line.startswith("PB")
returns True
.
with open(r'c:\raw-sequence.txt', 'r') as f, open(r'c:\sequence.txt', 'a') as sequence_raw_buffer:
for line in f:
if line.startswith("PB"):
print(line, file=sequence_raw_buffer)
This illustrates the fact that files can be iterated over line-by-line. I use print to dump the line because it will append the correct line terminator that the for
loop strips off.
This example also shows you how to put multiple context managers into a single with
block. You should have all your file in a with
block, whether input or output, because I/O errors are a possibility in both directions.
Now, if you are trying to use regex for practice or because the match is really more complicated than what you present here, you can try
PB\s*\((?:\d+,\s*)*\d+\)
This matches as follows:
PB
\s*
\(
(?:)*
, repeated as many times as necessary, containing
\d+
,
\s*
\d
\)
I would not bother concatenating the whole file together and using findall
on that though, unless your expression can span multiple lines. I would prefer to still use the approach shown above, because in all but a few cases that I can think of, textual data will generally be delimited by newlines:
pattern = re.compile('PB\s*\((?:\d+,\s*)*\d+\)')
...
if pattern.match(line):
...
Pre-compiling the pattern once makes the program run faster, but you could call re.match(..., line)
every time as well.
Upvotes: 2
Reputation: 11032
There are few things that you are missing
\b
) and bracket (
()
have different meanings in regex context. Parenthesis denotes capturing group. To match parenthesis literally you need to escape it.Now to match the exact pattern you intend, you can use this
\bPB\s+\((?:\s*\d+\s*,\s*)*\d+\)
If you want to only match lines with PB
you can directly search for PB
Upvotes: 0