Aditya Bhattacharya
Aditya Bhattacharya

Reputation: 1014

Extract information about education institute, grades, year and degree from text using NLP in Python

I want to extract information about education institute, degree, year of passing and grades (CGPA/GPA/Percentage) from text using NLP in Python. For example, if I have the input:

NBN Sinhgad School Of Engineering,Pune 2016 - 2020 Bachelor of Engineering Computer Science CGPA: 8.78 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016 Intermediate-PCM,Economics CBSE Percentage: 88.8 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 - 2014 Matriculation,CBSE CGPA: 8.6 EXPERIENCE

I want the ouput:

[{
  "Institute": "NBN Sinhgad School Of Engineering",
  "Degree": "Bachelor of Engineering Computer Science",
  "Grades": "8.78",
  "Year of Passing": "2020"
}, {
  "Institute": "Vidya Bharati Chinmaya Vidyalaya",
  "Degree": "Intermediate-PCM,Economics",
  "Grades": "88.8",
  "Year of Passing": "2016"
}, {
  "Institute": "Vidya Bharati Chinmaya Vidyalaya",
  "Degree": "Matriculation,CBSE",
  "Grades": "8.6",
  "Year of Passing": "2014"
}]

Can it be done without training any custom NER model? Is there any pre-trained NER available to do this?

Upvotes: 0

Views: 633

Answers (1)

Ramesh
Ramesh

Reputation: 585

yes it is possible to parse the data without training any custom NER model. you have to build the custom rules to parse the data.

In your example case, you can the extract data by regex and pattern identification like institute always before the year of passing or something. if it is not unordered,you have to go by keywords like school, institute,college ans so on... either way it depends on your case.

import re

txt = '''NBN Sinhgad School Of Engineering,Pune 2016 - 2020 Bachelor of Engineering Computer Science CGPA: 8.78 
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016 Intermediate-PCM,Economics CBSE Percentage: 88.8
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 - 2014 Matriculation,CBSE CGPA: 8.6 EXPERIENCE'''

# extract grades
grade_regex = r'(?:\d{1,2}\.\d{1,2})'
grades = re.findall(grade_regex, txt)

# extract years
year_regex = r'(?:\d{4}\s?-\s?\d{4})'
years = re.findall(year_regex, txt)


# function to replace a value in string
def replacer(string, noise_list):
    for v in noise_list:
        string = string.replace(v, ":")
    return string


# extract college
data = replacer(txt, years)
cleaned_text = re.sub("(?:\w+\s?\:)", "**", data).split('\n')
college = []
degree = []
for i in cleaned_text:
    split_data = i.split("**")
    college.append(split_data[0].replace(',', '').strip())
    degree.append(split_data[1].strip())
parsed_output = []
for i in range(len(grades)):
    parsed_data = {
        "Institute": college[i],
        "Degree": degree[i],
        "Grades": grades[i],
        "Year of Passing": years[i].split('-')[1]
    }
    parsed_output.append(parsed_data)
print(parsed_output)

>>>> [{'Institute': 'NBN Sinhgad School Of Engineering', 'Degree': 'Bachelor of Engineering Computer Science', 'Grades': '8.78', 'Year of Passing': ' 2020'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Intermediate-PCM,Economics CBSE', 'Grades': '88.8', 'Year of Passing': ' 2016'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Matriculation,CBSE', 'Grades': '8.6', 'Year of Passing': ' 2014'}]

Upvotes: 1

Related Questions