Reputation: 1207
This question has a reference (here). I am quite new to Python and thus getting stuck in somewhat trivial issues!!! I have a data series as follows
Text
0 some texts...qualifications: BE year of passing 2012
1 MCOM from XYZ University in 2007. In 2009 he obtained his MBA
2 Academics: University / Board: XYZ University year of passing:2014
Objective is to extract the years as mentioned (only the first ones) i.e. 2012,2007,2014
. Now my approach is as follows:
corpus = pd.Series('the above series')
corpus = corpus.str.replace(r'^[A-Za-z0-9]+')
corpus = corpus.str.lower()
if corpus.str.contains('qualifications').any():
corpus.str.extract('.*qualifications.*?(\d{4})', expand = False)
if corpus.str.contains('university').any():
corpus.str.extract('.*university. *?(d\{4})', expand=False)
if corpus.str.contains('academics').any():
corpus.str.extract('.*academics. *?(d\{4})',expand=False)
The above approach is creating a blank series. Kindly help me in solving this.
Upvotes: 1
Views: 204
Reputation: 49784
I think you can simplify that expression to simply this:
corpus = corpus.str.lower().str.extract(
'(university|academics|qualifications).*?(\d{4})', expand=False)
corpus = pd.Series("""
some texts...qualifications: BE year of passing 2012
MCOM from XYZ University in 2007. In 2009 he obtained his MBA
Academics: University / Board: XYZ University year of passing:2014
""".split('\n')[1:-1], name='Text')
corpus = corpus.str.lower().str.extract(
'(university|academics|qualifications).*?(\d{4})', expand=False)
print(corpus)
0 1
0 qualifications 2012
1 university 2007
2 academics 2014
Upvotes: 2