Reputation: 11
I have written logic to extract dates of experiences from the resume. I have extracted experiences that have this format :
01/2017 - 04/2022
01/07/2017 - 31/07/2017
March 2017 - July 2022
Here is the code :
cur_datespan = None
next_first_date = None
delimeter_count = 0
for ptoken, token in zip(tokens, tokens[1:]):
token = str(token).lower().strip()
ptoken = str(ptoken).lower().strip()
tokenpair = token + " " + ptoken
# find datespanes
if re.search("\d+", token) != None:
dates = search_dates(tokenpair, settings={
'REQUIRE_PARTS': ['month', 'year']}) or []
else:
dates = []
for date in dates:
if next_first_date == None:
next_first_date = date[1]
delimeter_count = 0
elif delimeter_count < 6:
cur_datespan = (next_first_date, date[1])
next_first_date = None
else:
next_first_date = date[1]
delimeter_count = 0
if delimeter_count > 50:
next_first_date = None
cur_datespan = None
delimeter_count += len(token.split(" "))
# find skill and add to dict with associated datespan
if token.lower() in skills:
skillset[cur_datespan].add(token)
elif (ptoken + " " + token).lower() in skills:
skillset[cur_datespan].add((ptoken + " " + token).lower())
skilldict = {}
for datespan, skills in skillset.items():
for skill in skills:
if skill not in skilldict:
skilldict[skill] = []
if datespan != None and datespan[1].month - datespan[0].month > 0:
skilldict[skill].append(datespan)
return skilldict
But I couldn't extract the experiences that have these formats for example :
March-July 2020
March 2020 - Current/Present
01/07/2017-31/07/2017 (date format "day_first")
2020-2021
From/Since 2020
From March 2020 to July 2022
Upvotes: 1
Views: 2823
Reputation: 1
import datefinder
import re
pattern = r'(\d{1,2}\s?\d{4})|(\d{4}\s?\d{1,2})|(\d{4})'
# Find all matches in the text
matches = re.findall(pattern, text)
# Extract the matched durations
durations = [] for match in matches: for group in match: if group: durations.append(group)
print(durations)
extracted_dates = []
for item in durations: matches = datefinder.find_dates(item)
# Check if any matches were found
for match in matches:
# Extract the month and year from the match
month = match.month
year = match.year
# Append the extracted month and year to the extracted_dates list
extracted_dates.append((month, year))
unique_years = set(year for month, year in extracted_dates)
for year in unique_years: has_month = False for month, y in extracted_dates: if y == year: has_month = True print(f"Year: {year}, Month: {month if month else '-'}") if not has_month: print(f"Year: {year}, Month: -")
output:
Year: 2017, Month: 2
Year: 2019, Month: 4
Year: 2019, Month: 4
Year: 2021, Month: 3
Year: 2021, Month: 2
Year: 2022, Month: 7
Year: 2022, Month: 6
Year: 2013, Month: 12
Upvotes: 0
Reputation: 11
You can use re.findall as follows
import re
resumes = ['01/2017 - 04/2022',
'01/07/2017 - 31/07/2017',
'March 2017 - July 2022',
'March-July 2020',
'March 2020 - Current/Present',
'01/07/2017-31/07/2017',
'2020-2021',
'From/Since 2020',
'From March 2020 to July 2022]']
pattern = r'(((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|June?|July?|Aug(ust)?|Sep(tember)?|Nov(ember)?|Dec(ember)?)|(\d{1,2}\/){0,2})[- ]?\d{4}?)'
for resume in resumes:
res = re.findall(pattern,resume)
if len(res) > 1:
print('from',res[0][0],'to',res[1][0])
else:
output
from 01/2017 to 04/2022
from 01/07/2017 to 31/07/2017
from March 2017 to July 2022
-> March-July 2020
-> March 2020 - Current/Present
from 01/07/2017 to 31/07/2017
from 2020 to -2021
-> From/Since 2020
from March 2020 to July 2022
Upvotes: 1