Reputation: 31
There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of: http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.
I would like to be able to get just the formatted text from every part and put it into one html file.
Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.
Unfortunately, the url for a print version is as follows: www.ultimate-guitar.com/print.php?what=article&id=95932
The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.
What I want to do is this:
Go to each page, incrementng through the varying numbers.
Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.
Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.
Rinse and repeat.
Is this possible? And is python the right language? Also, what dom/html/xml library should I use?
Thanks for any help.
Upvotes: 2
Views: 5421
Reputation: 2010
With lxml and urllib2:
import lxml.html
import urllib2
#implement the logic to download each page, with HTML strings in a sequence named pages
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s"
for page in pages:
html = lxml.html.fromstring(page)
ID = html.find(".//input[@name='rowid']").value
article = urllib2.urlopen(url % ID).read()
article_html = lxml.html.fromstring(article)
with open(ID + ".html", "w") as html_file:
html_file.write(article_html.find(".//body").text_content())
edit: Upon running this, it seems there may be some Unicode characters in the page. One way to get around this is to do article = article.encode("ascii", "ignore")
or to put the encode method after .read(), to force ASCII and ignore Unicode, though this is a lazy fix.
This is assuming you just want the text content of everything inside the body tag. This will save files with the format of storyID.html (so "95932.html") in the local directory of the Python file. Change the save semantics if you like.
Upvotes: 1
Reputation: 20270
You could actually do this in javascript/jquery without too much trouble. javascripty-pseudocode, appending to an empty document:
for(var pageNum = 1; i<= 91; i++) {
$.ajax({
url: url + pageNum,
async: false,
success: function() {
var printId = $('input[name="rowid"]').val();
$.ajax({
url: printUrl + printId,
async: false,
success: function(data) {
$('body').append($(data).find('body').contents());
}
});
}
});
}
After the loading completes you could save the resultant HTML to a file.
Upvotes: 0