Reputation: 181
I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Upvotes: 3
Views: 1186
Reputation: 626689
It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S
/ re.DOTALL
if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2)
- capture group one that matches and stores in .group(1)
any of the bucket names:
- a colon(.*?)
- any 0+ chars, as few as possible (as *?
is a lazy quantifier), up to the first occurrence of (but not inlcuding)...(?=(?:BUCKET1|BUCKET2)|$)
- any of the bucket names or end of string.Build it dynamically while escaping bucket names (just to play it safe in case those names contain *
or +
or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.
Upvotes: 0
Reputation: 92854
Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]
Upvotes: 1
Reputation: 546
If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.
Upvotes: 1
Reputation: 13372
Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
Upvotes: 0