Reputation: 1101

Regex pattern tantrums

I am trying to find the best regular expression pattern to extract a sub string from a string.

The string is of the type,

0816606366.Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

I would like to create a regex that would give me everything after the first period. So in this case, the required sub string would be,

Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

I tried

\w+
\w*
[\w]*

and everything else in between but Im just not able to get the result I want. Could someone please point me in the right direction?

Thank you

edit: My apologies. I forgot to mention the programming language I was using. I am using Python and the re module that it comes with.

Upvotes: 0

Answers (5)

Gaijinhunter

Reputation: 14685

There are many ways to do this as you can see above. The way I prefer is:

^[^.]*\.(.*)$

^ means start of sentence.
[^.]* is zero or more characters that are not "."
. is the period. You need to add the "\" before it (. means any character otherwise)
(.*) is zero or more characters. Parenthesis means this is what we want to extract.
$ is end of line

You can test all sorts of methods out on the fly here:

http://www.pythonregex.com/

Upvotes: 0

vim keytar

Reputation: 391

You should certainly read the manual first before posting a question this specific. If you have a Unix-like environment with the Perl documentation installed, this should be your first stop:

perldoc perlre

Alternatively, you can read the documentation online

perl -e '"ab.cd.ef.gh" =~ m/[^.]+.(.+)/; print $1'

[.]   # Use the square bracket to match a given set of characters. 
[^.]  # Use the caret symbol to invert the matching set. 
[^.]+ # The plus symbol matches one or more of the previous symbol. 
\.    # The escaping backslash and period matches a literal period character
()    # Use parenthesis to capture a submatch
(.+)  # Use the period to match any one character and the plus

Here's a great tool for building regular expressions:

http://txt2re.com/

Upvotes: 0

Spudley

Reputation: 168803

Simple regex to separate the first part from the rest:

/^.+?\.(.+)$/

Then just grab the content of capturing group 1.

To explain it:

^ and $ match the start end end the string.

.+? is a non-greedy match for any number of any character (non-greedy (denoted by the question mark) because otherwise it would match the whole string; this way it stops at the dot to allow the rest of the expression to match)

\. is a dot character, which is our delimiter.

(.+) another any number of any characters match; this time it's greedy because we don't mind; there's nothing after it anyway. Wrapped in brackets to make it into a capturing group, so we can extract it from the regex engine.

You haven't specified the language you're working in, but a generic bit of code could look something like this:

var output = input.replace(/^.+?\.(.+)$/,"$1");

Hope that helps.

Upvotes: 4

Kaken Bok

Reputation: 3395

^[^\.]+\.(.+)$

^ start ^
[^.]+ all not . chars
. the first .
(.*) the rest
$ the end

Upvotes: 2

Ian Boyd

Reputation: 257009

\d+\.(.+)

and replacement is

$1

Documentation is:

\d match a digit
\d+ match more than one digit
\d+\. followed by a "."
\d+\..+ followed by anything
\d+\.(.+) capture the "anything" chunk

i tested it at RegEx Planet:

Regular Expression: \d+\.(.+)
Replacement: $1
Test String#1: 0816606366.Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

Result: Univ.of.Minnesota.Pr.Minnesota.Messenia.Expedition.Reconstructing.a.Bronze.Age.Regional.Environment.Jun.1972.pdf

Upvotes: 1

Regex pattern tantrums

Answers (5)

Related Questions