Reputation: 521

In python how do you extract a certain characters from a string?

I need to extract the date and the location from this string. Is there a more efficient way? which would also be less prone to mistakes e.g. the word in front of the time might not always be from.

text = 'Join us for a guided tour of the Campus given by the 
Admissions staff. The tour will take place from 3:15-4:00 PM EST 
and leaves from the Admissions Office in x House. No registration required.' 

length = len(text)

for x in range (length):
    if text[x] == 'f' :
        if text[x+1] == 'r' :
            if text[x+2] == 'o':
                if text[x+3] == 'm':
                   fprint(text[x:(x+17)])
                   fbreak

= from 3:15-4:00 PM

Upvotes: 0

Answers (3)

KromviellBlack

Reputation: 918

You are not limited to use only regular expressions to parse string content.

Instead of using regular expressions, you may use parsing technique which is described below. It is similar to technique which is used in compilers.

Simple example of the technique

For the start you may look to this example. It will only find times in the text.

TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

TIME_SEPARATORS = ':-'

time_text_start = None
time_text_end = None
time_text = ''

index = 0
for char in TEXT:
    if time_text_start is None:
        if char.isdigit():
            time_text_start = index
    if (time_text_start is not None) and (time_text_end is None):
        if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS):
            time_text_end = index
            time_text = TEXT[time_text_start: time_text_end].strip()

            print(time_text)

            # Now we will clear our variables to be able to find next time_text data in the text
            time_text_start = None
            time_text_end = None
            time_text = ''
    index += 1

This code will print next:

3:15-4:00
7:30
17:30
9:30-11:00
15:00-16:25

Real code

Now you can look to the real code. It will find all data you need: time, period, time standard and location.

Location in the text must be located after the time and between words "in" and "home".

To add additional searching conditions, you may modify def find(self, text_to_process) method of the EventsDataFinder class.

To change formatting (for example to return full time of only end time), you may modify def _prepare_event_data(time_text, time_period, time_standard, event_place) method of the EventsDataFinder class.

PS: I understand that the classes can be difficult to understand for beginners. So I've tried to make this code as simple as possible. But without classes, code would be difficult to understand. So there is a few.

class TextUnit:
    text = ''
    start = None
    end = None
    absent = False

    def fill_from_text(self, text):
        self.text = text[self.start: self.end].strip()

    def clear(self):
        self.text = ''
        self.start = None
        self.end = None
        self.absent = False


class EventsDataFinder:
    time_standards = {
        'est',
        'utc',
        'dst',
        'edt'
    }
    time_standard_text_len = 3

    period = {
        'am',
        'pm'
    }
    period_text_len = 2

    time_separators = ':-'

    event_place_start_indicator = ' in '
    event_place_end_indicator = ' house'

    fake_text_end = '.'

    def find(self, text_to_process):
        '''
        This method will parse given text and will return list of tuples. Each tuple will contain time of the event
        in the desired format and location of the event.
        :param text_to_process: text to parse
        :return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')]
        '''
        text = text_to_process.replace('\n', '')
        text += self.fake_text_end

        time_text = TextUnit()
        time_period = TextUnit()
        time_standard = TextUnit()
        event_place = TextUnit()

        result_events = list()

        index = -1
        for char in text:
            index += 1

            # Time text
            if time_text.start is None:
                if char.isdigit():
                    time_text.start = index
            if (time_text.start is not None) and (time_text.end is None):
                if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators):
                    time_text.end = index
                    time_text.fill_from_text(text)

            # Time period
            # If time_text is already found:
            if (time_text.end is not None) and \
                    (time_period.end is None) and (not time_period.absent) and \
                    (not char.isspace()):
                potential_period = text[index: index + self.period_text_len].lower()
                if potential_period in self.period:
                    time_period.start = index
                    time_period.end = index + self.period_text_len
                    time_period.fill_from_text(text)
                else:
                    time_period.absent = True

            # Time standard
            # If time_period is already found or does not exist:
            if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \
                    (time_standard.end is None) and (not time_standard.absent) and \
                    (not char.isspace()):
                potential_standard = text[index: index + self.time_standard_text_len].lower()
                if potential_standard in self.time_standards:
                    time_standard.start = index
                    time_standard.end = index + self.time_standard_text_len
                    time_standard.fill_from_text(text)
                else:
                    time_standard.absent = True

            # Event place
            # If time_standard is already found or does not exist:
            if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \
                    (event_place.end is None) and (not event_place.absent):
                if self.event_place_end_indicator.startswith(char.lower()):
                    potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower()
                    if potential_event_place == self.event_place_end_indicator:
                        event_place.end = index
                        potential_event_place_start = text.rfind(self.event_place_start_indicator,
                                                                 time_text.end,
                                                                 event_place.end)
                        if potential_event_place_start > 0:
                            event_place.start = potential_event_place_start + len(self.event_place_start_indicator)
                            event_place.fill_from_text(text)
                        else:
                            event_place.absent = True

            # Saving result and clearing temporary data holders
            # If event_place is already found or does not exist:
            if event_place.absent or (event_place.end is not None):
                result_events.append(self._prepare_event_data(time_text,
                                                              time_period,
                                                              time_standard,
                                                              event_place))
                time_text.clear()
                time_period.clear()
                time_standard.clear()
                event_place.clear()

        # This code will save data of the last incomplete event (all that was found). If it exists of course.
        if (time_text.end is not None) and (event_place.end is None):
            result_events.append(self._prepare_event_data(time_text,
                                                          time_period,
                                                          time_standard,
                                                          event_place))

        return result_events

    @staticmethod
    def _prepare_event_data(time_text, time_period, time_standard, event_place):
        '''
        This method will prepare found data to be saved in a desired format
        :param time_text: text of time
        :param time_period: text of period
        :param time_standard: text of time standard
        :param event_place: location of the event
        :return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA')
        '''
        event_time = time_text.text  # '3:15-4:00'
        split_time = event_time.split('-')  # ['3:15', '4:00']
        if 1 < len(split_time):
            # If it was, for example, '3:15-4:00 PM EST' in the text
            start_time = split_time[0].strip()  # '3:15'
            end_time = split_time[1].strip()  # '4:00'
        else:
            # If it was, for example, '3:15 PM EST' in the text
            start_time = event_time  # '3:15'
            end_time = ''  # ''
        period = time_period.text.upper()  # 'PM'
        standard = time_standard.text.upper()  # 'EST'
        event_place = event_place.text  #

        # Removing empty time fields (for example if there is no period or time standard in the text)
        time_data_separated = [start_time, period, standard]
        new_time_data_separated = list()
        for item in time_data_separated:
            if item:
                new_time_data_separated.append(item)
        time_data_separated = new_time_data_separated

        event_time_interval = ' '.join(time_data_separated)
        result = (event_time_interval, event_place)

        return result


TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

edf = EventsDataFinder()

print(edf.find(TEXT))

Let's say we have next text:

Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House.

The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.

The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.

The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.

The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.

No registration required.

So this code will print:

[('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB'), ('17:30 UTC', 'C CCC C'), ('9:30 AM', 'DDD'), ('15:00', 'EE EE')]

Upvotes: 0

Wiktor Stribiżew

Reputation: 627419

To extract a starting time from a time range, use the regex:

(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b

See the regex demo

Details:

(?i) - case insensitive matching ON
\b - leading word boundary
(\d{1,2}:\d{2}) - Group 1 capturing 1 or 2 digits, : and 2 digits
(?:-\d{1,2}:\d{2})? - an optional non-capturing group matching 1 or 0 occurrences of:
- - - a hyphen
- \d{1,2} - 1 or 2 digits
- : - a colon
- \d{2} - 2 digits
(\s*[pa]m) - Group 2 capturing a sequence of:
- \s* - 0+ whitespaces
- [pa] - p or a (or P or A)
- m - m or M
\b - a trailing word boundary.

See Python demo:

import re
rx = r"(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b"
s = "Join us for a guided tour of the Campus given by the \nAdmissions staff. The tour will take place from 3:15-4:00 PM EST or from 7:30 AM EST  \nand leaves from the Admissions Office in x House. No registration required.' "
matches = ["{}{}".format(x.group(1),x.group(2)) for x in re.finditer(rx, s)]
print(matches)

Since the results are in 2 separate groups, we need to iterate all the matches and concat the two group values.

Upvotes: 3

Safirah

Reputation: 375

You could use this regular expression:

r"from [^A-Za-z]+"

Which checks in text for a place that starts with "from" and that hasn't any letters after (except AM or PM). On the text you provided it returns

from 3:15-4:00 PM

You could use it the following way:

import re print(re.search("from [^A-Za-z]+(?:AM|PM)", text))

Upvotes: 0

In python how do you extract a certain characters from a string?

Answers (3)

Simple example of the technique

Real code

Related Questions