Reputation: 1280
I have a long string that is actually a set of concepts. I want to mine the string and to create a list of concepts.
The string begins with:
Abduction and retroduction Action research: a case study Analysis of variance (ANOVA) Attitudes Autobiography see Biographical method...
The list contains dictionary entries. In vast majority of cases the capital letters mark the beginning of new entry. I want to make a list of entries.
I have tried re.findall(r"([A-Z].+?)\s[A-Z]")
. But it filters out every second entry. Instead of ["Abduction and retroduction", "Action research: a case study", "Analysis of variance (ANOVA)"] I get: ["Abduction and retroduction", "Analysis of variance (ANOVA)"]
Upvotes: 0
Views: 90
Reputation: 89557
By default you can have overlapping results, it is the reason why all second contiguous entry is skipped (since you match his first letter). A way to avoid this problem is to not match this first letter by using a lookahead assertion (?=..)
that means "followed by" (A lookahead is only a check and matches nothing):
re.findall(r"(\b[A-Z].+?)(?=\s[A-Z]|\s*$)")
Upvotes: 1