steve
steve

Reputation: 431

Extracting/Parsing Pronoun-Pronoun and Verb-Noun/Pronoun Combinations from a Sentence

Problem:
I am trying to extract a list of proper nouns from a job description, such as the following.

text = "Civil, Mechanical, and Industrial Engineering majors are preferred."

I want to extract the following from this text:

Civil Engineering
Mechanical Engineering
Industrial Engineering

This is one case of the problem, so use of application-specific information will not work. For instance, I cannot have a list of majors and then try to check if parts of the names of those majors are in the sentence along with the word "major" since I need this for other sentences as well.

Attempts:
1. I have looked into spacy dependency-parsing, but parent-child relationships do not show up between each Engineering type (Civil,Mechanical,Industrial) and the word Engineering.

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Civil, Mechanical, and Industrial Engineering majors are preferred.")

print( "%-15s%-15s%-15s%-15s%-30s" % ( "TEXT","DEP","HEAD TEXT","HEAD POS","CHILDREN" ) )
for token in doc:
    if not token.text in ( ',','.' ):
        print( "%-15s%-15s%-15s%-15s%-30s" % 
          ( 
              token.text 
              ,token.dep_
              ,token.head.text
              ,token.head.pos_
              ,','.join( str(c) for c in token.children )
          ) )

...outputting...

TEXT           DEP            HEAD TEXT      HEAD POS       CHILDREN                      
Civil          amod           majors         NOUN           ,,Mechanical                  
Mechanical     conj           Civil          ADJ            ,,and                         
and            cc             Mechanical     PROPN                                        
Industrial     compound       Engineering    PROPN                                        
Engineering    compound       majors         NOUN           Industrial                    
majors         nsubjpass      preferred      VERB           Civil,Engineering             
are            auxpass        preferred      VERB                                         
preferred      ROOT           preferred      VERB           majors,are,.                  
  1. I have also tried using nltk pos tagging, but I get the following...

    import nltk nltk.pos_tag( nltk.word_tokenize( 'Civil, Mechanical, and Industrial Engineering majors are preferred.' ) )

[('Civil', 'NNP'),
 (',', ','),
 ('Mechanical', 'NNP'),
 (',', ','),
 ('and', 'CC'),
 ('Industrial', 'NNP'),
 ('Engineering', 'NNP'),
 ('majors', 'NNS'),
 ('are', 'VBP'),
 ('preferred', 'VBN'),
 ('.', '.')]

The types of engineering and the word Engineering all come up as NNP (proper nouns), so any kind of RegexpParser pattern I can think of does not work.

Question:
Does anyone know of a way - in Python 3 - to extract these noun phrase pairings?

EDIT: Addition Examples

The following examples are similar to the first example, except these are verb-noun / verb-propernoun versions.

text="Experience with testing and automating API’s/GUI’s for desktop and native iOS/Android"

Extract:

testing API’s/GUI’s
automation API’s/GUI’s
text="Design, build, test, deploy and maintain effective test automation solutions"

Extract:

Design test automation solutions
build test automation solutions
test test automation solutions
deploy test automation solutions
maintain test automation solutions

Upvotes: 0

Views: 1385

Answers (1)

Alex
Alex

Reputation: 1262

Without any external imports and assuming that the lists are always formatted as comma separated with an optional "and" after the final one, it's possible to write some regex and do some string manipulation to get the output you want:

import re

test_string = "Civil, Mechanical, and Industrial Engineering majors are preferred."
result = re.search(r"(([A-Z][a-z]+, )+(and)? [A-Z][a-z]+ ([A-Z][a-z]+))+", test_string)
group_type = result.group(4)
string_list = result.group(1).rstrip(group_type).strip()
items = [i.strip().strip('and ') + ' ' + group_type for i in string_list.split(',')]

print(items)  # ['Civil Engineering', 'Mechanical Engineering', 'Industrial Engineering']

Again this is all based on the narrow assumption of how the list are formatted. You may need to modify the regex pattern if there are more possibilities.

Upvotes: 0

Related Questions