Reputation: 302
First things first, sorry for the lengthy title. Here are my system specs. Windows 7 64 bit, running python 3.4.3 64 bit in Pycharm Educational Edition 1.0.1
Now, onto the problem. I have a list that contains data pulled from a website. The list contains strings, some being just dates, some being just words, and some being dates with words. It looks like this:
tempDates = ['Date', 'Visitor', 'Home', 'Notes', '2013-10-01', 'Washington Capitals', 'Chicago Blackhawks', None, '2013-10-01', 'Winnipeg Jets',..., 'St. Louis Blues',..., 'Postponed due to blizzard until 2014-02-25', etc]
What I am trying to do is remove everything but the stand-alone dates. Using a generator, a while loop, and an if statement, I was able to remove everything but the strings that contain both dates and words. That portion of code looks like this:
dates = []
d = 0
while d < len(tempDates):
if tempDates[d] is None or all(i.isalpha() or i == ' ' or i == ',' or i == '-' or i == '.' for i in tempDates[d]):
d += 1
else:
dates.append(tempDates[d])
d += 1
The output of this code is this:
dates = ['2013-10-01', '2013-10-01',..., '2014-01-21', '2014-01-21', '2014-01-21', 'Postponed due to snowstorm until 2014-01-22', '2014-01-22', 'Make-up game for snowstorm 2014-01-21',..., '2014-06-13']
I can't find any way of removing the strings that have both words and dates without removing the stand-alone dates as well. I have tried changing the order in which the program sorts dates from tempDates, but that only causes more issues with infinite loops and memory issues. If it helps, here is the full program:
1 from bs4 import BeautifulSoup
2 import requests
3 import pandas as pd
4 import re
5
6 # create empty lists to hold the data pulled from the website
7 dateList = []
8 gameList = []
9 winnerList = []
10 loserList = []
11
12 year = 2014 #program is made to iterate through all available seasons since 1918, but is set to start at 2014 for quicker troubleshooting
13 while year < 2016: # stops year at 2015, as this is the last year stats are available
14 if year == 2005: # prevents an error from the program trying to load data from 2005, as that season was canceled
15 year += 1
16 else:
17 # pulls the whole page and puts it into r
18 r = requests.get('http://www.hockey-reference.com/leagues/NHL_{}_games.html'.format(year))
19 data = r.text
20
21 soup = BeautifulSoup(data, "lxml")
22 foundTeams = soup.find_all(href=re.compile("teams"))
23 teams = [link.string for link in foundTeams]
24 teams = teams[2:]
25
26 foundScores = soup.find_all(align=re.compile("right"))
27 tempScores = [link.string for link in foundScores]
28 tempScores = tempScores[2:]
29
30 foundDates = soup.find_all(align=re.compile("left"))
31 tempDates = [link.string for link in foundDates]
32
33 dates = []
34 d = 0
35 while d < len(tempDates):
36 if tempDates[d] is None or all(i.isalpha() or i == ' ' or i == ',' or i == '-' or i == '.' for i in tempDates[d]):
37 d += 1
38 else:
39 dates.append(tempDates[d])
40 d += 1
41
42 season = soup.find('h1')
43 season = [link.string for link in season]
44
45 games = len(teams) / 2
46 games = int(games)
47
48 # goes through the pulled data and saves it into lists to be written to a compiled file
49 x = 0
50 y = 0
51 while x < len(teams):
52 if x % 2 == 0:
53 if tempScores[x] is None or all(i.isalpha() or i == ' ' or i == ',' for i in tempScores[x]):
54 x += 2
55 else:
56 print(dates[y])
57 if tempScores[x] > tempScores[x+1]:
58 print("In game", y + 1)
59 print("The", teams[x], tempScores[x], "won against the", teams[x+1], tempScores[x+1])
60 winnerList.append(teams[x])
61 loserList.append(teams[x+1])
62 elif tempScores[x] < tempScores[x+1]:
63 print("In game", y + 1)
64 print("The", teams[x+1], tempScores[x+1], "won against the", teams[x], tempScores[x])
65 winnerList.append(teams[x+1])
66 loserList.append(teams[x])
67 dateList.append(dates[y])
68 gameList.append(y)
69 x += 1
70 y += 1
71 else:
72 x += 1
73 year += 1
74
75 # converts the compiled lists to data frames
76 dateList = pd.DataFrame(dateList)
77 gameList = pd.DataFrame(gameList)
78 winnerList = pd.DataFrame(winnerList)
79 loserList = pd.DataFrame(loserList)
80
81 # puts the data frames into one data frame
82 compiledStats = dateList
83 compiledStats['Game'] = gameList
84 compiledStats['Game Winner'] = winnerList
85 compiledStats['Game Loser'] = loserList
86
87 # rename the columns
88 compiledStats.columns = ['Date', 'Game', 'Game Winner', 'Game Loser']
89 # write to a new file
90 compiledStats.to_csv('CSV/Compiled_NHL_Stats2.0.csv', index=False)
Upvotes: 2
Views: 55
Reputation: 67968
tempDates = ['Date', 'Visitor', 'Home', 'Notes', '2013-10-01', 'Washington Capitals', 'Chicago Blackhawks', None, '2013-10-01', 'Winnipeg Jets','...', 'St. Louis Blues','...', 'Postponed due to blizzard until 2014-02-25', 'etc']
print [i for i in tempDates if re.match(r"\d{4}-\d{2}-\d{2}",str(i))]
This should do it for you
Upvotes: 3