Reputation:
I have a text file with content shown below. I need to identify paragraph headings and create a csv file column heading from each extracted paragraph heading. The text file looks like the text block below. I was thinking of using a rule like:
if (capitalized) and heading_length <50:
return heading_text
Is there something in NLTK or NLP that could help do this without an approximate way of just checking capitalized letters and word length?
This is an old Kaggle competition
DUTIES
A 311 Director is responsible for the successful operation and expansion of the 311 Call Center in the Information Technology Agency (ITA) which answers call from constituents regarding Citywide services provided by City departments; works to ensure the efficient and effective resolution of any issues that may arise; plans, directs, hires, coaches, and coordinates a large staff of professional, technical and clerical employees engaged in the implementation, administration, and operations of the City's 311 Call Center; applies sound supervisor principles and techniques in building and maintaining and effective work force; fulfills equal opportunity responsibilities; and does related work.
REQUIREMENTS
- One year of full-time paid experience as a Senior Management Analyst with the City of Los Angeles or in a class which is at least at the level which provides professional experience in supervisory or managerial work relating to a call center with at least 50 call agents or a call center that receives at least one million calls annually; or
- A Bachelor's degree from a recognized college or university and four years of full-time paid experience in a call center environment with at least 50 call agents or a call center that receives at least one million calls annually, two years of which must be supervising staff working at such a call center; or
- Eight years of full-time paid experience in a call center environment with at least 50 call agents or call center that receives at least one million calls annually, two years of which must be supervising staff working at such a call center.
NOTES:
- In addition to the regular City application, all applicants must complete a 311 Director Qualifications Questionnaire at the time of filing. The 311 Director Qualifications Questionnaire is located within the Qualifications Questions section of the City application. Applicants who fail to complete the Qualifications Questionnaire will not be considered further in this examination, and their application will not be processed.
- Applicants who lack six months or less of the required experience may file for this examination. However, they cannot be appointed until the full experience requirement is met.
- Call center experience related to sales and telemarketing is excluded.
- Customer Relations Management (CRM) systems expertise, including implementation, integration, and knowledge base creation is highly desired.
WHERE TO APPLY
Applications will only be accepted online. When you are viewing the online job bulletin of your choice, simply scroll to the top of the page and select the "Apply" icon. Online job bulletins are also available at http://agency.governmentjobs.com/lacity/default.cfm for Open Competitive Examinations and at http://agency.governmentjobs.com/lacity/default.cfm?promotionaljobs=1 for Promotional Examinations.
NOTE:
Should a large number of qualified candidates file for this examination, an expert review committee may be assembled to evaluate each candidate's qualifications for the position of 311 Director. In this evaluation, the expert review committee will assess each applicant's training and experience based upon the information in the applicant's City employment application and the Qualifications Questionnaire. Those candidates considered by the expert review committee as possessing the greatest likelihood of successfully performing the duties of a 311 Director, based solely on the information presented to the committee, will be invited to participate in the interview.
Upvotes: 1
Views: 3302
Reputation: 15593
There is not a generic function to check if a piece of text is a document header in NLTK or other libraries. while the document you're working with capitalizes headers, that isn't a universal convention.
In your case I would do this:
for line in text.split('\n'):
is_header = (line.upper() == line)
Your example doesn't have any long all-caps lines so I don't think you actually need to check length, but you could if you want to. It could also make your code faster, though depending on how much text you have it may not matter.
You could learn a statistical model to classify lines into headers and non-headers, but if all your documents look like your example I think the above code is fine.
Upvotes: 0
Reputation: 1540
You can use regular expressions.
import re
text = open('sample.txt').read()
pattern = re.compile('([A-Z]+[ ]?[:]?)+\n')
headings = []
for match in pattern.finditer(text):
s=match.start()
e = match.end()
headings.append(text[s:e].replace('\n',''))
print(headings)
Output:
['DUTIES', 'REQUIREMENTS', 'NOTES:', 'WHERE TO APPLY', 'NOTE:']
To remove ':' colon, you can use .replace() function
Upvotes: 2