SantoshGupta7
SantoshGupta7

Reputation: 6197

Using pipes in re.split results in extra split occuring

I am trying to split a string by |INDEX| and /.

re.split can handle multiple separator and uses pipes to separate each separator, so they need to be escaped.

I tried separating with:

a = 'Tokenized/0003036v1|INDEX|3847.story.json'
re.split( r"/|\|INDEX|\|"  , a)

However, this resulted in an extra, empty split:

['Tokenized', '0003036v1', '', '3847.story.json']

Why are there 4 items in the list with an empty item, instead of three?

Upvotes: 1

Views: 30

Answers (2)

Nick
Nick

Reputation: 147206

You have an error in your regex, with an extra | before the closing \| for |INDEX|, so the string is being split on |INDEX and |, resulting in the empty string between them. Change the regex to this:

re.split( r"/|\|INDEX\|"  , a)

Upvotes: 1

Surya Tej
Surya Tej

Reputation: 1392

instead of

re.split( r"/|\|INDEX|\|"  , a)

use this

re.split( r"/|\|INDEX\|"  , a)

# splitting based on maxsplit argument to know where the problem is present
>>> re.split( r"/|\|INDEX|\|"  , a,1)
['Tokenized', '0003036v1|INDEX|3847.story.json']
>>> re.split( r"/|\|INDEX|\|"  , a,2)
['Tokenized', '0003036v1', '|3847.story.json']
>>> re.split( r"/|\|INDEX|\|"  , a,3)
['Tokenized', '0003036v1', '', '3847.story.json']
>>> re.split( r"/|\|INDEX\|"  , a)
['Tokenized', '0003036v1', '3847.story.json']

Upvotes: 1

Related Questions