user623990
user623990

Reputation:

Extracting two values from python regex

I've got a file formatted like this:

3 name1
2    name2
1    name3

The space between the number and the name can be one or several spaces, or any number of tabs.

I'm trying to find a way to match this line with a regex and extract the number and the name in a list or tuple.

I could write this in several lines, but I'd rather have one clean line that can both recognize tabs and whitespace and give me my values. I've been unsuccessful in doing that.

edit: I've tried using re.search('^[\d]+[\s|\t]+.*', line) to match any number of digits, either spaces or tabs and then anything. But this doesn't work - presumably because I'm not telling it what to extract for me.

Upvotes: 1

Views: 344

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

You don't need a regex at all, you can str.split it does not matter if you have 1 or 21 spaces between:

lines="""3 name1
2    name2
1    name3"""

for line in lines.splitlines():
    num, name = line.split()
    print(num,name)
3 name1
2 name2
1 name3

In a list comp:

print([line.split() for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

replace the lines.splitlines() with your file object in your own code.

Using a regex to split on whitespace is not a very good approach:

In [13]: timeit re.search('^(\d+)\s+(.*)', line).groups()
1000000 loops, best of 3: 2.04 µs per loop

In [14]: timeit line.split()
1000000 loops, best of 3: 222 ns per loop
Out[15]: ('1', 'abc')
In [16]: line.split()
Out[16]: ['1', 'abc']

split does the exact same thing in just over a tenth of the time.

Even if there are more than two values you can split and extract the first two:

lines="""3 name1 foo
2    name2  bar
1    name3 foobar """


print( [line.split(None, 2)[:2] for line in lines.splitlines()])
[['3', 'name1'], ['2', 'name2'], ['1', 'name3']]

Upvotes: 3

John1024
John1024

Reputation: 113844

All you need to do is add parens around what you want to capture:

>>> line='1\t abc'
>>> re.search('^(\d+)\s+(.*)', line).groups()
('1', 'abc')

Incidentally, notice that the regex that you used starts with a ^ which matches only at the beginning of a line. Consequently, match can be used in place of search here:

>>> re.match('(\d+)\s+(.*)', line).groups()
('1', 'abc')

Upvotes: 5

Related Questions