Reputation: 4028
I have a text file inside it is:
"000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|"
Now I'm trying to use a regular expression to get the first chunk of number before '|ROOT ', the number is 000000002.
I tried to use:
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall("^\s*[0-9].(ROOT$)", lines)[0]
print(x)
And it does not work. My strategy is to get the string start with number and end with ROOT, and get the first match.
Upvotes: 1
Views: 1643
Reputation: 189397
ROOT$
requires the four characters ROOT
adjacent to the end of the line. findall
returns all matches; if you only care about the first, probably simply use match
or search
.
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
m = re.match(r'(\d+)\|ROOT', line)
if m:
print(m.group(1))
break
The break
causes the loop to terminate as soon as the first match is found. We read one line at a time until we find one which matches, then terminate. (This also optimizes the program by avoiding the unnecessary reading of lines we do not care about, and by avoiding reading more than one line into memory at a time.) The parentheses in the regex causes the match inside them to be captured into group(1)
.
Upvotes: 1
Reputation: 2162
Check out this code :
import re
# 000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|
file = './file.txt'
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall(r"(\d*[0-9])\|ROOT", lines)
print(x)
x = re.findall(r"(\d*[0-9])\|ROOT", lines)[0]
print(x)
OUTPUT :
['000000002', '000000003', '000000004']
000000002
Upvotes: 1