Reputation: 33
I am attempting to parse some logs to get the specific catalog numbers for the items viewed. I have broken out all the necessary fields and am now parsing the referer field to get the catalog id of the page viewed.
The strings are in the following formats:
/catalog/AAA1111111
/catalog/BBB-22222-1/
/catalog/CCC-333333/XXX
http://url/catalog/DDD-44444444
http://url/catalog/EEE-555555555/ZZZ
I am using the following regex to strip out the catalog id:
.*\/catalog\/([^\/]+)
The problem is that I cannot stop the regex from grabbing everything after the next forward slash. It looks like it is to greedy?
The results are:
AAA1111111
BBB-22222-1/
CCC-333333/XXX
DDD-44444444
http:EEE-555555555/ZZZ
I've been banging my head on this one for a couple of hours.
I am just looking for a regex that will split out just the catalog id (the string after catalog/.)
Can anyone help guide this old coder in the proper direction?
Many thanks.
Upvotes: 2
Views: 53
Reputation: 9225
using sed
cat catalogs | sed -E 's/.*\/catalog\/([^/]+)\/?.*/\1/g'
results in
AAA1111111
BBB-22222-1
CCC-333333
DDD-44444444
EEE-555555555
note the only modification is matching the trailing stuff
Upvotes: 1
Reputation: 1519
Why using a regex when you can split on "/catalog/", take the last item then split on "/" and take the 1st item ?
In Python, this could be done like this :
line.split('/catalog/')[-1].split('/')[0]
Just wanted to point out that regexp are not the solutions for every string parsing problems. Often, when you're faced to "greedy" parsing, doing a "manual" modification before using regexp helps
Upvotes: 0