Reputation: 2495

Python Regex Simple Split - Empty at first index

I have a String that looks like

test = '20170125NBCNightlyNews'

I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)

I am trying to use re. I have a working version by writing.

re.split('(\d+)',test)

Simple enough, this gives me the values I need in a list.

['', '20170125', 'NBCNightlyNews']

However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.

I also tried telling it to match the begininning of the string as well, and got the same results.

>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>

Does anyone have any input as to why this is there / how I can avoid the empty string?

Upvotes: 3

Answers (6)

fyrescyon

Reputation: 21

If the date is always 8 digits long, I would access the substrings directly (without using regex):

>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']

If the length of the date might vary, I would use:

>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Upvotes: 1

masual

Reputation: 89

From the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.

So if you have:

test = 'test20170125NBCNightlyNews'

The indexes would remain unaffected:

>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

Upvotes: 2

Gabriel Reiser

Reputation: 402

Why re.split when you can just match and get the groups?...

import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')

result = re.match(pattern, test)
result.groups()[0]  # for the date part
result.groups()[1]  # for the show name

I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

Upvotes: 2

TemporalWolf

Reputation: 7952

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:

test[:8], test[8:]

Will split your strings just fine.

Upvotes: 4

ssc-hrep3

Reputation: 16089

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.

So, if you have

test = '20170125NBCNightlyNews'

This is happening:

 20170125 NBCNightlyNews
 ^^^^^^^^

The string is split into three parts, everything before the number, the number itself and everything after the number.

Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.

re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']

re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

Upvotes: 3

anubhava

Reputation: 785561

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.

To avoid that you can use filter:

>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

Upvotes: 2

Python Regex Simple Split - Empty at first index

Answers (6)

Related Questions