How to extract a list of strings out of single string by using regex?

Question

I have a long string that is actually a set of concepts. I want to mine the string and to create a list of concepts.

The string begins with:

Abduction and retroduction Action research: a case study Analysis of variance (ANOVA) Attitudes Autobiography see Biographical method...

The list contains dictionary entries. In vast majority of cases the capital letters mark the beginning of new entry. I want to make a list of entries.

I have tried re.findall(r"([A-Z].+?)\s[A-Z]"). But it filters out every second entry. Instead of ["Abduction and retroduction", "Action research: a case study", "Analysis of variance (ANOVA)"] I get: ["Abduction and retroduction", "Analysis of variance (ANOVA)"]

Casimir et Hippolyte · Accepted Answer

By default you can have overlapping results, it is the reason why all second contiguous entry is skipped (since you match his first letter). A way to avoid this problem is to not match this first letter by using a lookahead assertion (?=..) that means "followed by" (A lookahead is only a check and matches nothing):

re.findall(r"(\b[A-Z].+?)(?=\s[A-Z]|\s*$)")

How to extract a list of strings out of single string by using regex?

Answers (1)

Related Questions