jkeuhlen
jkeuhlen

Reputation: 4517

Can't get a specific regex to work in Perl

I have a string formatted like:

project-version-project_test-type-other_info-other_info.file_type

I can strip most of the information I need out of this string in most cases. My trouble arises when my version has an extra qualifying character in it (i.e. normally 5 characters but sometimes a 6th is added).

Previously, I was using substrings to remove the excess information and get the 'project_test-type' however, now I need to switch to a regex (mostly to handle that extra version character). I could keep using substrings and change the length depending on whether I have that extra version character or not but a regex seems more appropriate here.

I tried using patterns like:

my ($type) = $_ =~ /.*-.*-(.*)-.*/;

But the extra '-' in the 'project_test-type' means I can't simply space my regex using that character.

What regex can I use to get the 'project_test-type' out of my string?


More information: As a more human readable example, the information is grouped in the following way:

project - version - project_test-type - other_info - other_info . file_type

Upvotes: 1

Views: 128

Answers (4)

mob
mob

Reputation: 118605

Greedy/non-greedy approach

($type) = /.*?-.*?-(.*)-.*-.*/;

.*? is a non-greedy match, meaning match any number of any character, but no more than necessary to match the regular expression. Using .* between the second and third dashes is a greedy match, matching as many characters as possible while still matching the regular expression, and using this will capture words with any extra dashes in them.

Upvotes: 0

ikegami
ikegami

Reputation: 385655

Since no field other than the desired one can contain -, any extra - belongs to the desired field.

      +--------------------------- project
      |     +--------------------- version
      |     |   +----------------- project_test-type
      |     |   |      +---------- other_info
      |     |   |      |     +---- other_info.file_type
      |     |   |      |     |
  ____| ____|  _|  ____| ____|
/^[^-]*-[^-]*-(.*)-[^-]*-[^-]*\z/

[^-] matches a character that's not a -.
[^-]* matches zero or more characters that's aren't -.

Upvotes: 5

maraca
maraca

Reputation: 8743

To match everything:

/^([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)\.([a-zA-Z0-9]+)$/

[] defines character sets and ^ at the beginning of a set means "NOT". Also a - in a set usually means a range, unless it is at the beginning or end. So [^-]+ consumes as many non-dash characters as possible (at least one).

Upvotes: 1

karthik manchala
karthik manchala

Reputation: 13640

You can use

/\w+\s*-\s*\d{5}[a-zA-Z]?\s*-\s*(.*?)(?=\s*-\s*\d)/

Explanation:

  • \w+\s*- ==> match character sequence followed by any number of spaces and a -
  • \d{5}[a-zA-Z]? ==> always 5 digits with one or zero character
  • (.*?) => match everything in a non greedy way
  • (?=\s*-\s*\d) => look forward for a digit and stop (since IP starts with a digit)

Demo and Explanation

Upvotes: 0

Related Questions