William
William

Reputation: 4028

Regular expression to get the first match in a text file

I have a text file inside it is:

"000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|"

Now I'm trying to use a regular expression to get the first chunk of number before '|ROOT ', the number is 000000002.

I tried to use:

with open(file, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.read()  
    x = re.findall("^\s*[0-9].(ROOT$)", lines)[0]

print(x)

And it does not work. My strategy is to get the string start with number and end with ROOT, and get the first match.

Upvotes: 1

Views: 1643

Answers (2)

tripleee
tripleee

Reputation: 189397

ROOT$ requires the four characters ROOT adjacent to the end of the line. findall returns all matches; if you only care about the first, probably simply use match or search.

with open(file, 'r', encoding='utf-8', errors='ignore') as f:
    for line in f:
        m = re.match(r'(\d+)\|ROOT', line)
        if m:
            print(m.group(1))
            break

The break causes the loop to terminate as soon as the first match is found. We read one line at a time until we find one which matches, then terminate. (This also optimizes the program by avoiding the unnecessary reading of lines we do not care about, and by avoiding reading more than one line into memory at a time.) The parentheses in the regex causes the match inside them to be captured into group(1).

Upvotes: 1

Davinder Singh
Davinder Singh

Reputation: 2162

Check out this code :

import re
# 000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|

file = './file.txt'
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
    lines = f.read()  
    x = re.findall(r"(\d*[0-9])\|ROOT", lines)
    print(x)
    x = re.findall(r"(\d*[0-9])\|ROOT", lines)[0]
    print(x)

OUTPUT :

['000000002', '000000003', '000000004']
000000002

Upvotes: 1

Related Questions