Reputation: 590
Imagine this is a part of a large text:
stuff (word1/Word2/w0rd3) stuff, stuff (word4/word5) stuff/stuff (word6) stuff (word7/word8/word9) stuff / stuff, (w0rd10/word11) stuff stuff (word12) stuff (Word13/w0rd14/word15) stuff-stuff stuff (word16/word17).
I want the words. The result must matches:
word1
Word2
w0rd3
word4
word5
word6
word7
word8
word9
w0rd10
word11
word12
Word13
w0rd14
word15
word16
word17
Also the result should not be like:
(word1) or (word1/Word2/w0rd3)
Basically no ( or ) or / allowed
What i have tried:
\((\w+)\/(\w+)\/(\w+)\)[^(]*\((\w+)\/(\w+)\)[^(]*\((\w+)\)
This matches those words but i have to duplicate it as many word exist which is not clean. Also i tried txt2re but it is duplicated as well and it is not a one line regex. In case i want to use it on a online regex evaluator and no coding is in access then i need a one line and short regex. And my preferred engine is Python and C#.
Update:
I have added some /
in the text. Also sorry for changing the accepted answer, All answers are correct in some way, But i have to choose the fastest and most efficient regex here.
Upvotes: 5
Views: 289
Reputation: 18490
A common solution is to check, if there is a closing )
ahead without any opening (
in between.
\w+\b(?=[^)(]*\))
\w+
matches one or more word characters, followed by a \b
word boundary(?=[^)(]*\))
look if closing )
is ahead with any non (
)
in betweenSo this pattern does not check for an opening (
before, but often that's not needed.
Upvotes: 2
Reputation: 163277
You could use a capturing group which will be returned by re.findall and match all between the parenthesis with a forward slash as a delimiter.
Then in the result you could split on a forward slash:
\((\w+(?:/\w+)*)\)
Explanation
\(
Match opening parenthesis(
Capturing group
\w+
Match 1+ word chars(?:/\w+)*
Match 0+ times a /
and 1+ word chars)
Close capturing group\)
Match closing parenthesisIf you want to match more than word characters you might use a negated character class [^()/]+
matching not parenthesis or a forward slash:
\(([^()/]+(?:/[^()/]+)*)\)
For example:
import re
regex = r"\(([^()/]+(?:/[^()/]+)*)\)"
test_str = "stuff (word1/Word2/w0rd3) stuff, stuff (word4/word5) stuff stuff (word6) stuff (word7/word8/word9) stuff stuff, (w0rd10/word11) stuff stuff (word12) stuff (Word13/w0rd14/word15) stuff-stuff stuff (word16/word17)."
res = list(map(lambda x: x.split('/'), re.findall(regex, test_str)))
Or see the flattened version.
Upvotes: 2
Reputation: 10960
Use findall
with look-behind assertion
(?<=[(/])\w+
>>> re.findall(r'(?<=[(/])\w+', input_string)
['word1', 'Word2', 'w0rd3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'w0rd10', 'word11', 'word12', 'Word13', 'w0rd14', 'word15', 'word16', 'word17']
Explaination
(?<=[(/])\w+
Positive Lookbehind
(?<=[(/])
- Assert that the Regex below matches
- Match a single character present in the list
[(/]
(
or/
matches a single character\w+
matches any word character (equal to[a-zA-Z0-9_]
)
+
Quantifier - Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Upvotes: 1
Reputation: 271050
Instead of matching the words, you can write a regex that matches the non-words, and split by the regex:
\)?[^)]+?\(|\).+|/
A non-word is either:
Upvotes: 2