Reputation: 809
I'd like to find a way to extract case names from U.S. courts from sentences. They usually take a predictable pattern, although I think they may be too varied to capture well with Regexs, so I was thinking about using NLP to locate them.
Here are a few examples of case names (bolded) as they might be used in partial sentences:
I've been experimenting with off-the-shelf packages (like TextBlob for Python), which helps do things like extract noun phrases -- I just don't know how to take the next step and recognize case names as a unit.
Upvotes: 2
Views: 154
Reputation: 6978
How about:
((re\.).*?,.*?\b(?<=\s)(?=[a-z]))|(?!\r|\n|\.)((\s\m[A-Z][a-z]+?\M\s).*?v\.\s.*?\b[A-Z].*?[a-z]\M)(?!\s[A-Z])|Ex\sparte\s\b[A-Z].*?[a-z](?=(\.|,|;|\s))
It's imperfect in that it doesn't capture only the bolded text (it might grab a little more, but it won't match a false-positive (as it needs to find the v.)), but it's guaranteed to find all the provided examples, plus all the Ex parte cases too that I gleaned from Wiki
There are three capture groups in this regex:
1. Matches with v.
2. Matches with re.
3. Matches with Ex parte
ps: this is generic PCRE regex pattern syntax. Most program/scripting languages and many of the more advanced text editors should find matches using this.
Upvotes: 1
Reputation: 6039
Illinois Wikifier will get most of these cases for you: http://cogcomp.cs.illinois.edu/demo/wikify/?id=25
Upvotes: 0