Reputation: 31
I am scraping data from WSJ Biggest Gainers website. I am new to Python, so I'm sure this is simple. I just can't find a clear answer to this.
My code currently only downloads the data from one page, but I want it to go back to the previous days of data, for example, and find_all
or select the data from the charts. How can I modify the URL in the code to do this? I am using Python 3.4.3 and bs4.
The nice thing is that the previous days website URLs only differ in a few numbers.
For example, This is last Friday http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150731.html?mod=mdc_pastcalendar
This is last Thursday
http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150730.html?mod=mdc_pastcalendar
Ideally I would like to be able to change the month, date, or year if I wish, and then loop the different page URLs to retrieve the data I wish.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'http://online.wsj.com/mdc/public/page/2_3021-gainnyse-gainer.html'
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text)
I just want to loop this function for the past X days. I have tried making a list of URLs to run with no luck. It is also more work to make a list of URLs, so if I could use something like %s or %d for month, year, and date, then that would be better.
Upvotes: 3
Views: 2047
Reputation: 180411
You can use a start date, then -= a day using timedelta passing the date to the url with str.format and strftime:
import requests
from bs4 import BeautifulSoup
from datetime import date,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = date.today()
for _ in range(5):
url = start_url.format(start.strftime("%Y%m%d"))
start -= timedelta(days=1)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text)
Just create whatever date you want. If you want a particular start date, just create a datetime object:
import requests
from bs4 import BeautifulSoup
from datetime import datetime,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = datetime(2015,07,31)
for _ in range(5):
print("Data for {}".format(start.strftime("%b %d %Y")))
url = start_url.format(start.strftime("%Y%m%d"))
start -= timedelta(days=1)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text.rstrip())
print(" ")
Output:
Data for Jul 31 2015
|
WHAT'S THIS?
|
1
MoneyGram International (MGI)
2
YRC Worldwide (YRCW)
3
Immersion (IMMR)
4
Skywest (SKYW)
5
Vital Therapies (VTL)
6
..........................
Data for Jul 30 2015
|
WHAT'S THIS?
|
1
H&E Equipment Services (HEES)
2
Senomyx (SNMX)
3
eHealth (EHTH)
4
Nutrisystem (NTRI)
5
Open Text (OTEX)
6
LivePerson (LPSN)
7
Sonus Networks (SONS)
8
FormFactor (FORM)
9
Pegasystems (PEGA)
10
Town Sports International Holdings (CLUB)
11
FARO Technologies (FARO)
12
Presbia (LENS)
13
If you only want to include weekdays and still get n
days, then we need to add a little more logic.
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = datetime(2015, 7, 31)
def only_weekdays_range(start, n):
i = 0
wk_days = {0, 1, 2, 3, 4}
while i != n:
while start.weekday() not in wk_days:
start -= timedelta(days=1)
yield start
i += 1
start -= timedelta(days=1)
for dte in (only_weekdays_range(start, 2)):
print("Data for {}".format(start.strftime("%b %d %Y")))
url = start_url.format(start.strftime("%Y%m%d"))
print(url)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text.rstrip())
print(" ")
The only_weekdays_range
will get n
days from our start date excluding weekends. You can do so by: print(list(only_weekdays_range(datetime(2015, 7, 26), 2)))
. We get [datetime.datetime(2015, 7, 24, 0, 0), datetime.datetime(2015, 7, 23, 0, 0)]
, which is friday the 24th
and thursday the 23rd
, because our start day is Sunday the 26th
If you want to also exclude holidays, then that is quite a bit more work. Another approach would only be to decrement n
when you get data returned from v_data
, but that could lead to infinite loops for various reasons.
Upvotes: 5