Reputation: 41
Trying to scrape weather condition (index 9 in list v) and save the variable for later. Having difficulty writing the proper regex to store condition that is either 1 or 2 words.
Tested my regex code on regexr.com and it looks fine but doesn't work when run in IDLE.
v = ['\n\n7:53 AM\n\n\n\n\n',
'\n\n\n\n\n\n48 \nF\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n45 \nF\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n89 \n%\n \n\n\n\n\n\n\n',
'\n\nSE\n\n\n\n\n',
'\n\n\n\n\n\n5 \nmph\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n0 \nmph\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n30.11 \nin\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n0.0 \nin\n \n\n\n\n\n\n\n',
'\n\nMostly Cloudy\n\n\n\n\n']
for condition in str(v[9]):
condition_search = re.findall('[A-Z]\w+', condition)
if len(condition_search) > 1:
condition = ' '
condition = condition.join(condition_search)
else:
condition = str(condition_search)
print(condition)
actual results:
'[]'
desired results
'Mostly Cloudy'
Upvotes: 3
Views: 113
Reputation: 580
Since you are doing scraping of some weather data am assuming the data you get is standardized in some way.
Looking at the data you can tell that information you need is padded by lots of newline and space characters at the front and back (which you don't need). To remove them:
Simpler Non-regex solution:
# This removes the leading and trailing white-space characters in each line,
# which also includes space, newline, tabs, etc,.
processed_weather_data = [line.strip() for line in v]
# Lets say you need weather condition which is at the 9th index
print(processed_weather_data[9])
Upvotes: 2
Reputation: 13195
Regexps are nice, but I think you are looking for .strip()
:
text='\n\nMostly Cloudy\n\n\n\n\n'
print(text.strip())
Result:
Mostly Cloudy
and the surrounding whitespace is gone.
(Find docs on https://docs.python.org/3/library/stdtypes.html)
Upvotes: 3
Reputation: 27723
Maybe, this would simply return that:
import re
v = ['\n\n7:53 AM\n\n\n\n\n',
'\n\n\n\n\n\n48 \nF\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n45 \nF\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n89 \n%\n \n\n\n\n\n\n\n',
'\n\nSE\n\n\n\n\n',
'\n\n\n\n\n\n5 \nmph\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n0 \nmph\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n30.11 \nin\n \n\n\n\n\n\n\n',
'\n\n\n\n\n\n0.0 \nin\n \n\n\n\n\n\n\n',
'\n\nMostly Cloudy\n\n\n\n\n']
condition_search = re.findall(r'[A-Z][A-Za-z\s]+[a-z]', v[9])
print(condition_search[0])
Mostly Cloudy
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
jex.im visualizes regular expressions:
Upvotes: 2