pkushiqiang
pkushiqiang

Reputation: 19

How to extract information from these sentences

I got a list of sentences like below:

They are some sentences I extracted from job descriptions. I want to extract information like: degree type, major, required or preferred. There are

The result should be like : { degree: Bachelor, major : Computer Science, required: True }

Thers are no obvious rules in these sentences. How can I achieve this goal?


Bachelor ’ s degree in Computer Science or equivalent
Pursuing B.S. or advanced degree in computer science or related technical/engineering degree .
Bachelor 's Degree in Computer Science or equivalent experience
Youre educated ( BS/MS in Computer Science or other technical degree ) .
•BS in Computer Science , Digital Media or similar technical degree with 3 + years of experience
· Bachelors degree .
Bachelor 's degree in computer science , design or related field
Ability to absorb , master and leverage emerging technologies
BA/BS degree or equivalent practical experience
Education Required : Bachelors Degree
• Bachelor 's degree in related field , OR four ( 4 ) years of experience in a directly related field .

Upvotes: 0

Views: 823

Answers (3)

Akson
Akson

Reputation: 691

Another suggestion to do this would be:

  • First: clean up the data - remove all punctuation, stop words,unwanted symbols etc.
  • Second: make a list of keywords are interested in.
  • Third: split your data into words (word_tokenize in nltk)
  • Fourth: make a dictionary of values you are looking in.
  • Fifth: lookup in the dictionary as you read the words list matching it with your keywords list and then append it into new output dictionary.

Hope this helps.

Upvotes: 0

Shivamshaz
Shivamshaz

Reputation: 280

So you are dealing with unstructured data, I hope using following steps you may reach to a decent accuracy level.

  1. Create a lookup table of list of all keywords that may occur in each of your required variables like degree, education etc. You need to mine various online sources to grab these keywords.
  2. Split your data into sentence or line by line and Iterate over the list.
  3. While iterating, look for the key words into your lookup tables and find the useful lines.
  4. Create hierarchal rules to accurately extract the variables, and append them in your result.

Overview of hierarchal rules:

  1. for example, Degree name will be completely alphabetic.
  2. Experience will be alphanumeric.
  3. Terms like pursuing will point towards variable name Major

Try to modify these rules on each iteration of code. Keep adding new rules. This is just the basic approach, I believe that if you do some iterations over your methodology, you will be able to extract information.

Upvotes: 1

Daniel
Daniel

Reputation: 6039

You probably need to gather a list of majors and degrees (for example : http://en.wikipedia.org/wiki/List_of_tagged_degrees ) to extract the degree and major. Then based on some general rules (or designing a classifier decide on "required" or "not required").

Upvotes: 0

Related Questions