Mho
Mho

Reputation: 89

Extract values based on a pattern in a list python

I would like to extract values based on certain pattern in a list.

**Example:**
ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']

**Expected Output:**
Tickercode-HF;BPO
StockCode-NYSE;NEW YORK
Relevancescore-81;0

**My code**:
Tickercode=[x for x in ticker if re.match(r'[\w\.-]+[\w\.-]+', x)]
Stockcode=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]
Relevancescore=[x for x in ticker if re.match(r'[\w\.-]+(%)+[\w\.-]+', x)]

**My output:**
['HF (NYSE) (81%);BPO (NEW YORK)]']
[]
[]

But i am getting wrong output. Please help me to resolve the issue.

Thanks

Upvotes: 0

Views: 2847

Answers (2)

gzc
gzc

Reputation: 8609

Firs, each item of ticker contains multiple records separated by semicolon, so I recommend normalize ticker. Then iterate over strings and extract info using pattern '(\w+) \(([\w ]+)\)( \(([\d]+)%\))?'.

import re

ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
ticker=[y for x in ticker for y in x.split(';')]

Tickercode=[]
Stockcode=[]
Relevancescore=[]

for s in ticker:
    m = re.search(r'(\w+) \(([\w ]+)\)( \(([\d]+)%\))?', s)
    Tickercode.append(m.group(1))
    Stockcode.append(m.group(2))
    Relevancescore.append(m.group(4))

print(Tickercode)
print(Stockcode)
print(Relevancescore)

Output:

['HF', 'BPO']
['NYSE', 'NEW YORK']
['81', None]

Update:

Using re.search instead of re.match which will match pattern from start of string. Your input have a leading white space, causing it failed.

You can add this to print which string doesn't match.

    if m is None:
        print('%s cannot be matched' % s)
        continue

Upvotes: 3

unpythonic
unpythonic

Reputation: 4070

The problem with your code is that you're building up each of your lists from the input. You're telling it, "make a list of the input if the input matches my regular expression". The re.match() only matches against the beginning of a string, so the only regex that matches is the one that matches against the ticker symbol itself.

I've reorganized your code a bit below to show how it can work.

  • Use re.compile() to the regex doesn't have to be created each time
  • Use re.search() so you can find your embedded patterns
  • Use match.group(1) to get the matching part of the query, not the whole of the input.
  • Break up your input so you're only handling one group at a time

    #!/usr/bin/env python
    
    import re
    
    # Example:
    ticker=['HF (NYSE) (81%);BPO (NEW YORK)]']
    
    # **Expected Output:**
    # Tickercode-HF;BPO
    # StockCode-NYSE;NEW YORK
    # Relevancescore-81;0
    
    tickercode=[]
    stockcode=[]
    relevancescore=[]
    
    ticker_re = re.compile(r'^\s*([A-Z]+)')
    stock_re = re.compile(r'\(([\w ]+)\)')
    relevance_re = re.compile(r'\((\d+)%\)')
    
    for tick in ticker:
        for stockinfo in tick.split(";"):
            ticker_match = ticker_re.search(stockinfo)
            stock_match = stock_re.search(stockinfo)
            relevance_match = relevance_re.search(stockinfo)
    
            ticker_code = ticker_match.group(1) if ticker_match else ''
            stock_code = stock_match.group(1) if stock_match else ''
            relevance_score = relevance_match.group(1) if relevance_match else '0'
    
            tickercode.append(ticker_code)
            stockcode.append(stock_code)
            relevancescore.append(relevance_score)
    
    print 'Tickercode-' + ';'.join(tickercode)
    print 'StockCode-' + ';'.join(stockcode)
    print 'Relevancescore-' + ';'.join(relevancescore)
    

Upvotes: 0

Related Questions