TheProofIsTrivium
TheProofIsTrivium

Reputation: 808

Regex: Matching a string starting with anything, then a hyphen

Let's assume I have the following text:

BBC - Here is the text

How would I use regex to test if the string starts with "* - " ?

Then remove the "* - ", to be left with just "Here is the text". (I am using python).

I use "*" because it obviously won't start with "BBC - " every time, it might be some other substring.

Would this work?

"^.* - "

Thank you very much.

Answer:

m = re.search(ur'^(.*? [-\xe2\u2014] )?(.*)', text)

This worked. Thank you @xanatos !

Upvotes: 1

Views: 13066

Answers (4)

xanatos
xanatos

Reputation: 111890

Try this piece of code:

str = u"BBC \xe2 abc - Here is the text"
m = re.search(ur'^(.*? [-\xe2] )?(.*)', str, re.UNICODE)

# or equivalent
# m = re.match(ur'(.*? [-\xe2] )?(.*)', str, re.UNICODE)

# You don't really need re.UNICODE, but if you want to use unicode
# characters, it's better you conside à to be a letter :-) , so re.UNICODE

# group(1) contains the part before the hypen
if m.group(1) is not None:
    print m.group(1)

# group(2) contains the part after the hypen or all the string 
# if there is no hypen
print m.group(2)

Explanation of the regexes:

^ is the beginning of the string (the match method always use the beginning
  of the string)
(...) creates a capturing group (something that will go in group(...)
(...)? is an optional group
[-\xe2] one character between - and \xe2 (you can put any number of characters
        in the [], like [abc] means a or b or c
.*? [-\xe2] (there is a space after the ]) any character followed by a space, an hypen and a space
      the *? means that the * is "lazy" so it will try to catch only the
      minimum number possible of characters, so ABC - DEF - GHI
      .* - would catch ABC - DEF -, while .* - will catch ABC - 

so

(.* [-\xe2] )? the string could start with any character followed by an hypen
         if yes, put it in group(1), if no group(1) will be None
(.*) and it will be followed by any character. You dont need the 
     $ (that is the end-of the string, opposite of ^) because * will 
     always eat all the characters it can eat (it's an eager operator)

Upvotes: 1

MT.
MT.

Reputation: 1925

/^.+-/ should work.

Following are the test cases according to your requirement:

Passes: foo -

Passes: bar-

Passes: -baz-

Passes: *qux-

Passes: -------------

Fails: ****

Fails: -foobar

Upvotes: 0

raina77ow
raina77ow

Reputation: 106443

Here's 'match everything before the first hyphen and that hyphen itself' pattern:

/^[^-]*-\s*/

It reads as follows:

^      - starting from the beginning of the string...
[^-]*  - match any number (including zero) of non-hyphens, then...
-      - match hyphen itself, then...
\s*    - match any number (including zero) of whitespace

Then you can just replace the string matched by the pattern with empty string: the resulf of the replacement is probably what you need overall. )

Upvotes: 3

arkascha
arkascha

Reputation: 42935

Use the ?-operator:

'^(.+ [-] )?(.+)$'

Maybe you want to implement it with a little more flexibility towards the whitespaces...

Some trivial and crude test script (using php instead of python, sorry for that!):

<?php
$string  = "BBC - This is the text.";
$pattern = '/^(.+ [-] )?(.+)$/';
preg_match($pattern, $string, $tokens);
var_dump($tokens);
?>

Output of the test scipt:

array(3) {
  [0] =>
  string(23) "BBC - This is the text."
  [1] =>
  string(6) "BBC - "
  [2] =>
  string(17) "This is the text."
}

The first parentheses match any text at the beginning of the string which starts with any character of length>0 whch is followed by a space character, then a literal hyphen and another space character. This sequense may or may not be present. The second parentheses match all the rest of the string up to the end.

Upvotes: 0

Related Questions