Amocat _
Amocat _

Reputation: 41

Regex in Python Loop

Trying to scrape weather condition (index 9 in list v) and save the variable for later. Having difficulty writing the proper regex to store condition that is either 1 or 2 words.

Tested my regex code on regexr.com and it looks fine but doesn't work when run in IDLE.

v = ['\n\n7:53 AM\n\n\n\n\n',
 '\n\n\n\n\n\n48 \nF\n    \n\n\n\n\n\n\n',
 '\n\n\n\n\n\n45 \nF\n    \n\n\n\n\n\n\n',
 '\n\n\n\n\n\n89 \n%\n    \n\n\n\n\n\n\n',
 '\n\nSE\n\n\n\n\n',
 '\n\n\n\n\n\n5 \nmph\n    \n\n\n\n\n\n\n',
 '\n\n\n\n\n\n0 \nmph\n    \n\n\n\n\n\n\n',
 '\n\n\n\n\n\n30.11 \nin\n    \n\n\n\n\n\n\n',
 '\n\n\n\n\n\n0.0 \nin\n    \n\n\n\n\n\n\n',
 '\n\nMostly Cloudy\n\n\n\n\n']

for condition in str(v[9]):
        condition_search = re.findall('[A-Z]\w+', condition)
        if len(condition_search) > 1:
            condition = ' '
            condition = condition.join(condition_search)
        else:
            condition = str(condition_search)

print(condition)

actual results:

'[]'

desired results

'Mostly Cloudy'

Upvotes: 3

Views: 113

Answers (3)

the23Effect
the23Effect

Reputation: 580

Since you are doing scraping of some weather data am assuming the data you get is standardized in some way.

Looking at the data you can tell that information you need is padded by lots of newline and space characters at the front and back (which you don't need). To remove them:

Simpler Non-regex solution:

# This removes the leading and trailing white-space characters in each line,
# which also includes space, newline, tabs, etc,.
processed_weather_data = [line.strip() for line in v]

# Lets say you need weather condition which is at the 9th index
print(processed_weather_data[9])

Upvotes: 2

tevemadar
tevemadar

Reputation: 13195

Regexps are nice, but I think you are looking for .strip():

text='\n\nMostly Cloudy\n\n\n\n\n'
print(text.strip())

Result:

Mostly Cloudy

and the surrounding whitespace is gone.
(Find docs on https://docs.python.org/3/library/stdtypes.html)

Upvotes: 3

Emma
Emma

Reputation: 27723

Maybe, this would simply return that:

import re
v = ['\n\n7:53 AM\n\n\n\n\n',
     '\n\n\n\n\n\n48 \nF\n    \n\n\n\n\n\n\n',
     '\n\n\n\n\n\n45 \nF\n    \n\n\n\n\n\n\n',
     '\n\n\n\n\n\n89 \n%\n    \n\n\n\n\n\n\n',
     '\n\nSE\n\n\n\n\n',
     '\n\n\n\n\n\n5 \nmph\n    \n\n\n\n\n\n\n',
     '\n\n\n\n\n\n0 \nmph\n    \n\n\n\n\n\n\n',
     '\n\n\n\n\n\n30.11 \nin\n    \n\n\n\n\n\n\n',
     '\n\n\n\n\n\n0.0 \nin\n    \n\n\n\n\n\n\n',
     '\n\nMostly Cloudy\n\n\n\n\n']

condition_search = re.findall(r'[A-Z][A-Za-z\s]+[a-z]', v[9])

print(condition_search[0])

Output

Mostly Cloudy

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 2

Related Questions