Reputation: 111
I have a long document in which the line of my interest starts with Categories :
. I want to find all words separated by ,
after Categories :
.
Here's an example line
Categories : Turbo Prop , Very Light , Light , Mid Size
I want to find start index
and end index
of Turbo Prop
, Very Light
, Light
, Mid Size
I am using following code
regex_pattern = r"(?<=Categories : )([A-Za-z ]+(?:,)?)+"
matched_text = regex.search(regex_pattern,doc_tex)
But matched_text.groups()
is only giving Mid Size
. In short, I want to find all occurences of group 1
after Categories
.
Upvotes: 1
Views: 81
Reputation: 626738
As you are using the PyPi regex module, you can get all captures per group, together with their start and end indices, using
import regex
text = "Categories : Turbo Prop , Very Light , Light , Mid Size"
regex_pattern = r"Categories\s*:(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+"
m = regex.search(regex_pattern, text)
result = list(zip(m.captures(1),m.starts(1),m.ends(1)))
print(result)
# => [('Turbo Prop', 13, 23), ('Very Light', 26, 36), ('Light', 39, 44), ('Mid Size', 47, 55)]
See the Python demo
More details from PyPi regex
documentation:
A match object has additional methods which return information on all the successful matches of a repeated capture group. These methods are:
matchobject.captures([group1, ...])
- Returns a list of the strings matched in a group or groups. Compare with
matchobject.group([group1, ...])
.matchobject.starts([group])
- Returns a list of the start positions. Compare with
matchobject.start([group])
.matchobject.ends([group])
- Returns a list of the end positions. Compare with
matchobject.end([group])
.matchobject.spans([group])
- Returns a list of the spans. Compare with
matchobject.span([group])
.
Note I had to revamp your regex a bit:
Categories\s*:
- matches Categories
, zero or more whitespaces, :
(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+
- one or more repetitions of
\s*
- zero or more whitespace chars([A-Za-z ]+)
- one or more ASCII letters or spaces\b
- a word boundary (so, Group 1 value will end with a letter)(?:\s*,)?
- an optional sequence of zero or more whitespace chars and a comma.Upvotes: 0
Reputation: 2374
It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss. It looks like Python's re module does not store all all instances of a repeated capture group; see issue 7132. The regex package, however, adds additional methods to handle repeated capture groups, including.
Hence, using the regex package with the matchedobject.starts
and matchedobject.ends
methods should work.
Upvotes: 0
Reputation: 780798
Do it in two steps. First split the line using :
, then split the second part using ,
.
category_string = line.split(':')[1]
categories = category_string.split(',')
Upvotes: 1