Reputation: 521
I need to extract the date and the location from this string. Is there a more efficient way? which would also be less prone to mistakes e.g. the word in front of the time might not always be from.
text = 'Join us for a guided tour of the Campus given by the
Admissions staff. The tour will take place from 3:15-4:00 PM EST
and leaves from the Admissions Office in x House. No registration required.'
length = len(text)
for x in range (length):
if text[x] == 'f' :
if text[x+1] == 'r' :
if text[x+2] == 'o':
if text[x+3] == 'm':
fprint(text[x:(x+17)])
fbreak
= from 3:15-4:00 PM
Upvotes: 0
Views: 129
Reputation: 918
You are not limited to use only regular expressions to parse string content.
Instead of using regular expressions, you may use parsing technique which is described below. It is similar to technique which is used in compilers.
For the start you may look to this example. It will only find times in the text.
TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
'No registration required. '
TIME_SEPARATORS = ':-'
time_text_start = None
time_text_end = None
time_text = ''
index = 0
for char in TEXT:
if time_text_start is None:
if char.isdigit():
time_text_start = index
if (time_text_start is not None) and (time_text_end is None):
if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS):
time_text_end = index
time_text = TEXT[time_text_start: time_text_end].strip()
print(time_text)
# Now we will clear our variables to be able to find next time_text data in the text
time_text_start = None
time_text_end = None
time_text = ''
index += 1
This code will print next:
3:15-4:00
7:30
17:30
9:30-11:00
15:00-16:25
Now you can look to the real code. It will find all data you need: time, period, time standard and location.
Location in the text must be located after the time and between words "in" and "home".
To add additional searching conditions, you may modify def find(self, text_to_process)
method of the EventsDataFinder
class.
To change formatting (for example to return full time of only end time), you may modify def _prepare_event_data(time_text, time_period, time_standard, event_place)
method of the EventsDataFinder
class.
PS: I understand that the classes can be difficult to understand for beginners. So I've tried to make this code as simple as possible. But without classes, code would be difficult to understand. So there is a few.
class TextUnit:
text = ''
start = None
end = None
absent = False
def fill_from_text(self, text):
self.text = text[self.start: self.end].strip()
def clear(self):
self.text = ''
self.start = None
self.end = None
self.absent = False
class EventsDataFinder:
time_standards = {
'est',
'utc',
'dst',
'edt'
}
time_standard_text_len = 3
period = {
'am',
'pm'
}
period_text_len = 2
time_separators = ':-'
event_place_start_indicator = ' in '
event_place_end_indicator = ' house'
fake_text_end = '.'
def find(self, text_to_process):
'''
This method will parse given text and will return list of tuples. Each tuple will contain time of the event
in the desired format and location of the event.
:param text_to_process: text to parse
:return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')]
'''
text = text_to_process.replace('\n', '')
text += self.fake_text_end
time_text = TextUnit()
time_period = TextUnit()
time_standard = TextUnit()
event_place = TextUnit()
result_events = list()
index = -1
for char in text:
index += 1
# Time text
if time_text.start is None:
if char.isdigit():
time_text.start = index
if (time_text.start is not None) and (time_text.end is None):
if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators):
time_text.end = index
time_text.fill_from_text(text)
# Time period
# If time_text is already found:
if (time_text.end is not None) and \
(time_period.end is None) and (not time_period.absent) and \
(not char.isspace()):
potential_period = text[index: index + self.period_text_len].lower()
if potential_period in self.period:
time_period.start = index
time_period.end = index + self.period_text_len
time_period.fill_from_text(text)
else:
time_period.absent = True
# Time standard
# If time_period is already found or does not exist:
if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \
(time_standard.end is None) and (not time_standard.absent) and \
(not char.isspace()):
potential_standard = text[index: index + self.time_standard_text_len].lower()
if potential_standard in self.time_standards:
time_standard.start = index
time_standard.end = index + self.time_standard_text_len
time_standard.fill_from_text(text)
else:
time_standard.absent = True
# Event place
# If time_standard is already found or does not exist:
if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \
(event_place.end is None) and (not event_place.absent):
if self.event_place_end_indicator.startswith(char.lower()):
potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower()
if potential_event_place == self.event_place_end_indicator:
event_place.end = index
potential_event_place_start = text.rfind(self.event_place_start_indicator,
time_text.end,
event_place.end)
if potential_event_place_start > 0:
event_place.start = potential_event_place_start + len(self.event_place_start_indicator)
event_place.fill_from_text(text)
else:
event_place.absent = True
# Saving result and clearing temporary data holders
# If event_place is already found or does not exist:
if event_place.absent or (event_place.end is not None):
result_events.append(self._prepare_event_data(time_text,
time_period,
time_standard,
event_place))
time_text.clear()
time_period.clear()
time_standard.clear()
event_place.clear()
# This code will save data of the last incomplete event (all that was found). If it exists of course.
if (time_text.end is not None) and (event_place.end is None):
result_events.append(self._prepare_event_data(time_text,
time_period,
time_standard,
event_place))
return result_events
@staticmethod
def _prepare_event_data(time_text, time_period, time_standard, event_place):
'''
This method will prepare found data to be saved in a desired format
:param time_text: text of time
:param time_period: text of period
:param time_standard: text of time standard
:param event_place: location of the event
:return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA')
'''
event_time = time_text.text # '3:15-4:00'
split_time = event_time.split('-') # ['3:15', '4:00']
if 1 < len(split_time):
# If it was, for example, '3:15-4:00 PM EST' in the text
start_time = split_time[0].strip() # '3:15'
end_time = split_time[1].strip() # '4:00'
else:
# If it was, for example, '3:15 PM EST' in the text
start_time = event_time # '3:15'
end_time = '' # ''
period = time_period.text.upper() # 'PM'
standard = time_standard.text.upper() # 'EST'
event_place = event_place.text #
# Removing empty time fields (for example if there is no period or time standard in the text)
time_data_separated = [start_time, period, standard]
new_time_data_separated = list()
for item in time_data_separated:
if item:
new_time_data_separated.append(item)
time_data_separated = new_time_data_separated
event_time_interval = ' '.join(time_data_separated)
result = (event_time_interval, event_place)
return result
TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
'No registration required. '
edf = EventsDataFinder()
print(edf.find(TEXT))
Let's say we have next text:
Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House.
The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.
The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.
The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.
The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.
No registration required.
So this code will print:
[('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB'), ('17:30 UTC', 'C CCC C'), ('9:30 AM', 'DDD'), ('15:00', 'EE EE')]
Upvotes: 0
Reputation: 627419
To extract a starting time from a time range, use the regex:
(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b
See the regex demo
Details:
(?i)
- case insensitive matching ON\b
- leading word boundary(\d{1,2}:\d{2})
- Group 1 capturing 1 or 2 digits, :
and 2 digits(?:-\d{1,2}:\d{2})?
- an optional non-capturing group matching 1 or 0 occurrences of:
-
- a hyphen\d{1,2}
- 1 or 2 digits:
- a colon\d{2}
- 2 digits(\s*[pa]m)
- Group 2 capturing a sequence of:
\s*
- 0+ whitespaces[pa]
- p
or a
(or P
or A
)m
- m
or M
\b
- a trailing word boundary.See Python demo:
import re
rx = r"(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b"
s = "Join us for a guided tour of the Campus given by the \nAdmissions staff. The tour will take place from 3:15-4:00 PM EST or from 7:30 AM EST \nand leaves from the Admissions Office in x House. No registration required.' "
matches = ["{}{}".format(x.group(1),x.group(2)) for x in re.finditer(rx, s)]
print(matches)
Since the results are in 2 separate groups, we need to iterate all the matches and concat the two group values.
Upvotes: 3
Reputation: 375
You could use this regular expression:
r"from [^A-Za-z]+"
Which checks in text for a place that starts with "from" and that hasn't any letters after (except AM or PM). On the text you provided it returns
from 3:15-4:00 PM
You could use it the following way:
import re
print(re.search("from [^A-Za-z]+(?:AM|PM)", text))
Upvotes: 0