oorahduc
oorahduc

Reputation: 185

Python Regex: Multiple conditional matches in same string

I am currently pooling this function to check multiple urls. It reads an html page into a string and matches a progress percentage of a file transfer like this:

def check(server):
    logging.info('Fetching {0}.'.format(server))
    # Open page
    response = urllib2.urlopen("http://"+server+"/avicapture.html")
    tall = response.read() # puts the data into a string
    html = tall.rstrip()
    # Grab progress percentage.
    match = re.search('.*In Progress \((.*)%\).*', html)

and then on this match, return the percentage number in a string to the parent process.

    if match:
        global temp
        global results
        temp = match.group(1)
        results = temp
        servers[server] = temp
        if int(temp) >= 98 and int(temp) <= 99:
            abort(server)
            alertmail(temp, server)
            rem = str(server)
            complete(rem)
            logging.info('{0} completed.'.format(server))
        return str(temp)

Sometimes it will not say "In Progress" and have a percentage, however. It will say "Transfer Aborted" or "Ready". How would I structure this so it returns whichever it finds, In Progress (percentage), Transfer Aborted, or Ready?

Edit: I forgot to mention that I need it to match the most recent file transfer, based off End Time. (See: http://www.whatdoiknow.net/dump/avicapture_full.html#status )

Partial solution:

    match = re.search('.*In Progress \((.*)%\).*', html)
    match2 = re.search('.*Ready.*', html)
    match3 = re.search('.*Transfer Aborted.*', html)
    if match:
        global temp
        temp = match.group(1)
        if int(temp) >= 98 and int(temp) <= 99:
            logging.info('{0} completed.'.format(server))
        return str(temp)
    elif match2:
        temp = "Ready"
        logging.info('{0} is ready.'.format(server))
        return str(temp)
    elif match3:
        temp = "Transfer Aborted"
        logging.info('{0} was Aborted.'.format(server))
        return str(temp)

This does not address my need for the identification of the most recent transfer, however..

Upvotes: 0

Views: 1854

Answers (1)

pavel_form
pavel_form

Reputation: 1790

You just need to use | in regex:

match = re.search(r"(In Progress \((.*)%\)|Transfer Aborted|Ready)", html)

With this match.group(1) will contain all matches (either In Progress (00%), Transfer Aborted or Ready, while match.group(2) will have number 00 (00 is a placeholder) on None in second and third case.

UPDATE 1: about need to get most recent line. This http://www.whatdoiknow.net/dump/avicapture.html page is rather simple html, so my propose is to use some html parsing tool (I recommend beautifulsoup4, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse it to tree, then find first row in table with N/A, get row before and apply re to its last column.

UPDATE 2: now that I think about it, there is probably no need to parse html. You can use re.findall (or re.finditer) to get list list of matched tuples of strings (match objects) and just get last item from it.

UPDATE 3: Update 1 and Update 2 came in assumption, that table is sorted by date. If not, then you'll need to include date pattern in regex and get match with max date from matches.

Upvotes: 1

Related Questions