Tommy
Tommy

Reputation: 111

Capture all occurences of substring after specific text regex python

I have a long document in which the line of my interest starts with Categories : . I want to find all words separated by , after Categories : . Here's an example line

Categories : Turbo Prop , Very Light , Light , Mid Size

I want to find start index and end index of Turbo Prop, Very Light, Light, Mid Size

I am using following code

regex_pattern = r"(?<=Categories : )([A-Za-z ]+(?:,)?)+"

matched_text = regex.search(regex_pattern,doc_tex)

But matched_text.groups() is only giving Mid Size. In short, I want to find all occurences of group 1 after Categories.

Upvotes: 1

Views: 81

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

As you are using the PyPi regex module, you can get all captures per group, together with their start and end indices, using

import regex
text = "Categories : Turbo Prop , Very Light , Light , Mid Size"
regex_pattern = r"Categories\s*:(?:\s*([A-Za-z ]+)\b(?:\s*,)?)+"
m = regex.search(regex_pattern, text)
result = list(zip(m.captures(1),m.starts(1),m.ends(1)))
print(result) 
# => [('Turbo Prop', 13, 23), ('Very Light', 26, 36), ('Light', 39, 44), ('Mid Size', 47, 55)]

See the Python demo

More details from PyPi regex documentation:

A match object has additional methods which return information on all the successful matches of a repeated capture group. These methods are:

  • matchobject.captures([group1, ...])
    • Returns a list of the strings matched in a group or groups. Compare with matchobject.group([group1, ...]).
  • matchobject.starts([group])
    • Returns a list of the start positions. Compare with matchobject.start([group]).
  • matchobject.ends([group])
    • Returns a list of the end positions. Compare with matchobject.end([group]).
  • matchobject.spans([group])
    • Returns a list of the spans. Compare with matchobject.span([group]).

Note I had to revamp your regex a bit:

  • Categories\s*: - matches Categories, zero or more whitespaces, :
  • (?:\s*([A-Za-z ]+)\b(?:\s*,)?)+ - one or more repetitions of
    • \s* - zero or more whitespace chars
    • ([A-Za-z ]+) - one or more ASCII letters or spaces
    • \b - a word boundary (so, Group 1 value will end with a letter)
    • (?:\s*,)? - an optional sequence of zero or more whitespace chars and a comma.

Upvotes: 0

ogdenkev
ogdenkev

Reputation: 2374

It looks like the comments answered the OP's question, but for completeness I thought I'd post the answer they discuss. It looks like Python's re module does not store all all instances of a repeated capture group; see issue 7132. The regex package, however, adds additional methods to handle repeated capture groups, including.

  • captures -Returns a list of the strings matched in a group or groups.
  • starts - Returns a list of the start positions.
  • ends - Returns a list of the end positions.
  • spans - Returns a list of the spans. Compare with matchobject.span([group]).

Hence, using the regex package with the matchedobject.starts and matchedobject.ends methods should work.

Upvotes: 0

Barmar
Barmar

Reputation: 780798

Do it in two steps. First split the line using :, then split the second part using ,.

category_string = line.split(':')[1]
categories = category_string.split(',')

Upvotes: 1

Related Questions