flavinsky
flavinsky

Reputation: 319

Julia: website scraping?

I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.

using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end

website_parser("https://www.nature.com/news/newsandviews")

The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?

Any help is very much appreciated, thank you

Upvotes: 6

Views: 2159

Answers (1)

phipsgabler
phipsgabler

Reputation: 20950

You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.

If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.

In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.

Specific elements can be extracted using the library Cascadia git repo for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).

Upvotes: 6

Related Questions