Jay
Jay

Reputation: 1101

Regex pattern tantrums

I am trying to find the best regular expression pattern to extract a sub string from a string.

The string is of the type,

0816606366.Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

I would like to create a regex that would give me everything after the first period. So in this case, the required sub string would be,

Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

I tried

and everything else in between but Im just not able to get the result I want. Could someone please point me in the right direction?

Thank you

edit: My apologies. I forgot to mention the programming language I was using. I am using Python and the re module that it comes with.

Upvotes: 0

Views: 87

Answers (5)

Gaijinhunter
Gaijinhunter

Reputation: 14685

There are many ways to do this as you can see above. The way I prefer is:

^[^.]*\.(.*)$
  • ^ means start of sentence.
  • [^.]* is zero or more characters that are not "."
  • . is the period. You need to add the "\" before it (. means any character otherwise)
  • (.*) is zero or more characters. Parenthesis means this is what we want to extract.
  • $ is end of line

You can test all sorts of methods out on the fly here:

http://www.pythonregex.com/

Upvotes: 0

vim keytar
vim keytar

Reputation: 391

You should certainly read the manual first before posting a question this specific. If you have a Unix-like environment with the Perl documentation installed, this should be your first stop:

perldoc perlre

Alternatively, you can read the documentation online

perl -e '"ab.cd.ef.gh" =~ m/[^.]+.(.+)/; print $1'

[.]   # Use the square bracket to match a given set of characters. 
[^.]  # Use the caret symbol to invert the matching set. 
[^.]+ # The plus symbol matches one or more of the previous symbol. 
\.    # The escaping backslash and period matches a literal period character
()    # Use parenthesis to capture a submatch
(.+)  # Use the period to match any one character and the plus

Here's a great tool for building regular expressions:

http://txt2re.com/

Upvotes: 0

Spudley
Spudley

Reputation: 168655

Simple regex to separate the first part from the rest:

/^.+?\.(.+)$/

Then just grab the content of capturing group 1.

To explain it:

^ and $ match the start end end the string.

.+? is a non-greedy match for any number of any character (non-greedy (denoted by the question mark) because otherwise it would match the whole string; this way it stops at the dot to allow the rest of the expression to match)

\. is a dot character, which is our delimiter.

(.+) another any number of any characters match; this time it's greedy because we don't mind; there's nothing after it anyway. Wrapped in brackets to make it into a capturing group, so we can extract it from the regex engine.

You haven't specified the language you're working in, but a generic bit of code could look something like this:

var output = input.replace(/^.+?\.(.+)$/,"$1");

Hope that helps.

Upvotes: 4

Kaken Bok
Kaken Bok

Reputation: 3395

^[^\.]+\.(.+)$
  • ^ start ^
  • [^.]+ all not . chars
  • . the first .
  • (.*) the rest
  • $ the end

Upvotes: 2

Ian Boyd
Ian Boyd

Reputation: 256581

\d+\.(.+)

and replacement is

$1

Documentation is:

  • \d match a digit
  • \d+ match more than one digit
  • \d+\. followed by a "."
  • \d+\..+ followed by anything
  • \d+\.(.+) capture the "anything" chunk

i tested it at RegEx Planet:

Regular Expression: \d+\.(.+)
Replacement: $1
Test String#1: 0816606366.Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

Result: Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

Upvotes: 1

Related Questions