user3234687
user3234687

Reputation: 33

Capturing content from a string

I am attempting to parse some logs to get the specific catalog numbers for the items viewed. I have broken out all the necessary fields and am now parsing the referer field to get the catalog id of the page viewed.

The strings are in the following formats:

   /catalog/AAA1111111
   /catalog/BBB-22222-1/
   /catalog/CCC-333333/XXX
   http://url/catalog/DDD-44444444
   http://url/catalog/EEE-555555555/ZZZ

I am using the following regex to strip out the catalog id:

   .*\/catalog\/([^\/]+)

The problem is that I cannot stop the regex from grabbing everything after the next forward slash. It looks like it is to greedy?

The results are:

   AAA1111111
   BBB-22222-1/
   CCC-333333/XXX
   DDD-44444444
   http:EEE-555555555/ZZZ

I've been banging my head on this one for a couple of hours.

I am just looking for a regex that will split out just the catalog id (the string after catalog/.)

Can anyone help guide this old coder in the proper direction?

Many thanks.

Upvotes: 2

Views: 53

Answers (2)

waTeim
waTeim

Reputation: 9225

using sed

cat catalogs  | sed -E 's/.*\/catalog\/([^/]+)\/?.*/\1/g'

results in

AAA1111111
BBB-22222-1
CCC-333333
DDD-44444444
EEE-555555555

note the only modification is matching the trailing stuff

Upvotes: 1

Raphaël Braud
Raphaël Braud

Reputation: 1519

Why using a regex when you can split on "/catalog/", take the last item then split on "/" and take the 1st item ?

In Python, this could be done like this :

line.split('/catalog/')[-1].split('/')[0]

Just wanted to point out that regexp are not the solutions for every string parsing problems. Often, when you're faced to "greedy" parsing, doing a "manual" modification before using regexp helps

Upvotes: 0

Related Questions