Adrian Daniel Culea
Adrian Daniel Culea

Reputation: 181

Python split before a certain character

I have following string:

BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6

I am trying to split it in a way I would get back the following dict / other data structure:

BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/

I can somehow split it if I only have one BUCKET, not multiple, like this:

res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect 

Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/

Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)


As per Wiktor Stribiżew comments, the following regex does the job:

 r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Upvotes: 3

Views: 1186

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.

That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.

You may use

r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.

Details:

  • (BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
  • : - a colon
  • (.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
  • (?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.

Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):

import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
     [('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

See the online Python demo.

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Use re.findall() function:

s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)

print(result)

The output:

[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

Upvotes: 1

Resin Drake
Resin Drake

Reputation: 546

If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.

string = input("Enter:") #Put your own input here.

tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
    someTuple = ("BUCKET"+tempList[i],tempList[i+1])
    outputList.append(someTuple)

print(outputList) #Put your own output here.

This will produce:

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.

Upvotes: 1

shad0w_wa1k3r
shad0w_wa1k3r

Reputation: 13372

Use regex instead?

impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'

output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)

Which gives

[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

Upvotes: 0

Related Questions