klex52s
klex52s

Reputation: 437

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:

LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks 
KHNX 141001 RECHNX Weather Service San Joaquin Valley

My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:

( )\w{3} #for the first string

and

\w{6} #for the 2nd string

But these find all 3 and 6 character strings leading up to the string I want.

Upvotes: 0

Views: 333

Answers (3)

glhr
glhr

Reputation: 4547

Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:

(?<=\d{6}\s)[A-Z]+

Demo: https://regex101.com/r/dsDHTs/1

Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:

(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*

Demo: https://regex101.com/r/dsDHTs/5

If you have a specific list of valid fields, you could also simply use:

(AAD|TMLB|RECHNX|RR4HNX)

https://regex101.com/r/dsDHTs/3

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 31011

To read first groups of word chars from each line, you can use a pattern like (\w+) (\w+) (\w+) (\w+).

Then, from the first line read group No 4 and from the second line read group No 3.

Look at the following program. It prints four groups from each source line:

import re

txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""

n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
    n += 1
    print(f'{n:2}: {line}')
    mtch = pat.search(line)
    if mtch:
        gr = [ mtch.group(i) for i in range(1, 5) ]
        print(f'    {gr}')

The result is:

 1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks 
    ['LTUS41', 'KCAR', '141558', 'AAD']
 2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
    ['KHNX', '141001', 'RECHNX', 'Weather']

Upvotes: 0

blhsing
blhsing

Reputation: 107124

Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):

re.search(r'\b\d+ (\w+)', s).group(1)

Upvotes: 0

Related Questions