Nick
Nick

Reputation:

python regex help: unknown information to skip

I'm having trouble with the needed regular expression... I'm sure I need to probably be using some combination of 'lookaround' or conditional expressions, but I'm at a loss.

I have a data string like:

pattern1 pattern2 pattern3 unwanted-groups pattern4 random number of tokens pattern5 optional1 optional2 more unknown unwanted junk separated with white spaces optional3 optional4 etc

where I have a matching expression for each of the 'pattern#' and 'optional#' groups (optional groups being groups that are not required in the data and therefore not always present), but I don't have any pattern (text is free-form) or group count to skip for the other sections other than all 'tokens' are separated by white space.

I've managed to figure out how to skip the unwanted stuff between the required groups but when I hit the optional groups, I'm lost. any suggestion on where I should be looking for hints/help?

Thanks

this is what I currently have:

pattern = re.compile(r'(?:(METAR|SPECI)\s*)*(?P<ICAO>[\w]{4}\s)*'
                r'(?P<NIL>(NIL)\s)*(?P<UTC>[\d]{6}Z\s)*(?P<AUTOCOR>(AUTO|COR)*\s)*'
                r'(?P<WINDS>[\w]{5,6}G*[\d]{0,2}(MPS|KT|KMH)\s)\s*'
                r'.*?\s' #skip miscellaneous between winds and thermal data
                r'(?P<THERM>[\d]{2}/[\d]{2}\s)\s*(?P<PRESS>A[\d]{4}\s)\s*'
                r'(?:RMK\s)\s*(?P<AUTO>AO\d\s)*'
                r'(?P<PEAK>(PK\sWND\s[\d]{5,6}/[\d]{2,4}))*'
                r'(?P<SLP>SLP[\d]{3}\s)*'
                r'(?P<PRECIP>P[\d]{4}\s)*'          
                r'(?P<remains>.*)'
                )

example = "METAR KCSM 162353Z AUTO 07011KT 10SM TS SCT100 28/19 A3000 RMK AO2 PK WND 06042/2325 WSHFT 2248 LTG DSNT ALQDS PRESRR SLP135 T02780189 10389 20272 53007="

data = pattern.match(example)

It seems to work for the first 10 groups, but that is about it....

again thanks everybody

Upvotes: 0

Views: 200

Answers (3)

Jim Dennis
Jim Dennis

Reputation: 17510

If all of your targets consist of things like "foo1", "bar22" etc (in other words a sequence of letters followed by a sequence of digits) and everything else (sequences of digits, "words" without numeric suffixes, etc) is "junk" then the following seems to be sufficient:

re.findall(r'[A-Za-z]+\d+', targetstr)

(We can't use just r'\w+\d+' because \w matches digits and _ (underscores) as well as letters).

If you're looking for a limited number of key patterns, or some of the junk might match "foo123 ... then you'll obviously have to be more specific.

Upvotes: 0

Geo
Geo

Reputation: 96937

If all the data is in that format I'd go with split instead. I think it will be faster.


str = "regex1 regex2 regex3 unwanted-regex regex4 random number of tokens regex5 optregex1 optregex2 more unknown unwanted junk separated with white spaces optregex3 optregex4 etc"
parts = str.split() # now you have each part as an element of the array.
for index,item in enumerate(parts):
   if index == 3:
      continue # this is unwanted-regex
   else:
      # do what you want with the information here

Upvotes: 4

Paolo Tedesco
Paolo Tedesco

Reputation: 57242

You need to use the | operator and findall:

>>> re.compile("(regex\d+|optregex\d+)")
>>> regex.findall(string)
[u'regex1', u'regex2', u'regex3', u'regex4', u'regex5', u'optregex1', u'optregex2', u'optregex3', u'optregex4']

An advice: there are several tools (GUIs) that allow you to experiment with (and actually help writing) regular expressions. For python, I'm quite fond of kodos.

Upvotes: 1

Related Questions