Reputation: 431
Problem:
I am trying to extract a list of proper nouns from a job description, such as the following.
text = "Civil, Mechanical, and Industrial Engineering majors are preferred."
I want to extract the following from this text:
Civil Engineering
Mechanical Engineering
Industrial Engineering
This is one case of the problem, so use of application-specific information will not work. For instance, I cannot have a list of majors and then try to check if parts of the names of those majors are in the sentence along with the word "major" since I need this for other sentences as well.
Attempts:
1. I have looked into spacy dependency-parsing, but parent-child relationships do not show up between each Engineering type (Civil,Mechanical,Industrial) and the word Engineering.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Civil, Mechanical, and Industrial Engineering majors are preferred.")
print( "%-15s%-15s%-15s%-15s%-30s" % ( "TEXT","DEP","HEAD TEXT","HEAD POS","CHILDREN" ) )
for token in doc:
if not token.text in ( ',','.' ):
print( "%-15s%-15s%-15s%-15s%-30s" %
(
token.text
,token.dep_
,token.head.text
,token.head.pos_
,','.join( str(c) for c in token.children )
) )
...outputting...
TEXT DEP HEAD TEXT HEAD POS CHILDREN Civil amod majors NOUN ,,Mechanical Mechanical conj Civil ADJ ,,and and cc Mechanical PROPN Industrial compound Engineering PROPN Engineering compound majors NOUN Industrial majors nsubjpass preferred VERB Civil,Engineering are auxpass preferred VERB preferred ROOT preferred VERB majors,are,.
I have also tried using nltk pos tagging, but I get the following...
import nltk nltk.pos_tag( nltk.word_tokenize( 'Civil, Mechanical, and Industrial Engineering majors are preferred.' ) )
[('Civil', 'NNP'), (',', ','), ('Mechanical', 'NNP'), (',', ','), ('and', 'CC'), ('Industrial', 'NNP'), ('Engineering', 'NNP'), ('majors', 'NNS'), ('are', 'VBP'), ('preferred', 'VBN'), ('.', '.')]
The types of engineering and the word Engineering all come up as NNP (proper nouns), so any kind of RegexpParser pattern I can think of does not work.
Question:
Does anyone know of a way - in Python 3 - to extract these noun phrase pairings?
EDIT: Addition Examples
The following examples are similar to the first example, except these are verb-noun / verb-propernoun versions.
text="Experience with testing and automating API’s/GUI’s for desktop and native iOS/Android" Extract: testing API’s/GUI’s automation API’s/GUI’s
text="Design, build, test, deploy and maintain effective test automation solutions" Extract: Design test automation solutions build test automation solutions test test automation solutions deploy test automation solutions maintain test automation solutions
Upvotes: 0
Views: 1385
Reputation: 1262
Without any external imports and assuming that the lists are always formatted as comma separated with an optional "and" after the final one, it's possible to write some regex and do some string manipulation to get the output you want:
import re
test_string = "Civil, Mechanical, and Industrial Engineering majors are preferred."
result = re.search(r"(([A-Z][a-z]+, )+(and)? [A-Z][a-z]+ ([A-Z][a-z]+))+", test_string)
group_type = result.group(4)
string_list = result.group(1).rstrip(group_type).strip()
items = [i.strip().strip('and ') + ' ' + group_type for i in string_list.split(',')]
print(items) # ['Civil Engineering', 'Mechanical Engineering', 'Industrial Engineering']
Again this is all based on the narrow assumption of how the list are formatted. You may need to modify the regex pattern if there are more possibilities.
Upvotes: 0