Amara
Amara

Reputation: 211

Regex to extract parts of string

I'm setting up some campaign codes which will appear as a query parameter in a URL. I'd like to automate the reporting of these campaign codes and have set them up in such a way that each parameter within the code has a specific set of values, which are recognised in the system via a look up. However, the end part of the string is free text. Here's an example:

socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1

As explained previously parameters 1-7 can be a number of different values that are already known to the system and I can just use a contains query to extract each of these values and use them in a look up to get their report friendly names. However, how can I extract the last part of the string eg mffs201403_sbj1 which is optional, but will always be free text with variable length and will always appear after the 7th colon.

In addition, is there a way to capture the mffs201403 bit only where I always use an underscore to separate the two parts at the end? This is because the first part identifies an individual campaign, where as the second part identifies a variant of that campaign, if it exists. So I'd like to report on all campaign variants, e.g. mffs201403_sbj1, mffs201403_sbj2, etc, as well as mffs201403 as a whole.

I been trying to get my head around Regex for the longest time and I've been unable to master it, so if anyone can help me with this I'd be extremely grateful

Upvotes: 0

Views: 480

Answers (3)

TheQ
TheQ

Reputation: 7027

I'm not sure what language you use, but this works fine in c#:

var input = "socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1";
var pattern = "^(?:[^:]+:){7}(?<last>(?<part1>[^_]+)_(?<part2>[^_]+))+$";
var match = Regex.Match(input, pattern);

if (match.Success)
{
    Console.WriteLine("Last: {0}", match.Groups["last"].Value);
    Console.WriteLine("Part1: {0}", match.Groups["part1"].Value);
    Console.WriteLine("Part2: {0}", match.Groups["part2"].Value);
}

It outputs:

Last: mffs201403_sbj1
Part1: mffs201403
Part2: sbj1

The regex works by finding "any characters other than :" followed by a :, and repeats this 7 times. Then it looks for "any character other than _", divided by a _, and puts the last parts in separate subgroups to easily extract them in code.

If you use some kind of third party tool that just takes a regex, i guess this will work better:

^(?:[^:]+:){7}([^_]*)_?([^_]*)$

The subgroups 1 and 2 will contain the two parts of the last variable, but it will also handle cases where there is no last variable, or it doesn't contain a _, or any of the parts before and after the _ is empty.

In order to just match the last variable, and nothing else, this regex can be used:

[^:]*$

$ is the end of the string, and we match everything before this that isn't a :.

However, to match something in the middle of the string, without also matching the surrounding characters, it gets a bit tricky, and maybe even impossible with regex. If you know that the string will never contain any _, except for in the last variable, you could use:

[^:]*_

Which works pretty much the same, but will always include the _ in the match.

Upvotes: 2

npinti
npinti

Reputation: 52185

Something like so should work for you: (\w+:){7}([^_]+)_(\w+).

This regular expression expects to find a string which is separated by an underscore after a repetition of 7 groups of word characters (denoted by \w which means upper case letters, lower case letters numbers and underscores).

If the last segment does not exist, then, the regular expression will fail. A working example can be found here.

In Java this would translate to:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("(\\w+:){7}([^_]+)_(\\w+)");
    String str1 = "socfb:obb:img:beg:rp:lo:mff:mffs201403_sbj1";
    String str2 = "socfb:obb:img:beg:rp:lo:mff";

    Matcher m1 = p.matcher(str1);
    if(m1.find())
    {
        System.out.println(m1.group(2));
        System.out.println(m1.group(3));
    }
    else
    {
        System.out.println("No content found for " + str1);
    }

    Matcher m2 = p.matcher(str2);
    if(m2.find())
    {
        System.out.println(m2.group(2));
        System.out.println(m2.group(3));
    }
    else
    {
        System.out.println("No content found for " + str2);
    }
}

Yields:

mffs201403
sbj1
No content found for socfb:obb:img:beg:rp:lo:mff

Upvotes: 0

dshepherd
dshepherd

Reputation: 5407

Not quite a direct answer to your quesion but: If this is done within a script then you don't really need to use a regex. Whichever programming language you're using should have a string splitting function which will be easier to use and much more readable.

For example in python:

strings = query_parameter.split(":")
final_string = strings[-1]

then to split up that string:

campaign = final_string.split("_")[0]
try:
    variant = final_string.split("_")[1]
except IndexError:
    variant = ""

Upvotes: 0

Related Questions