Reputation: 1
I'm very green when it comes to Python but I see how powerful it is. I'd like to try a few things with it but I'm pretty much teaching myself so please, feel free to explain things in their most basic terms. :/
I tried the goose extraction tool to pull some text from a URL and it work pretty well. I was pretty simple...
from goose import Goose
url = 'http://example.com'
g = Goose()
article = g.extract(url=url)
article.cleaned_text
I'd like to replicate the process so I can extract text from hundreds of URLs. Is there a way to set this up so I can enter a list of URLs, extract text, and then (my guess) I could join them together for NLP or whatever else I want to do? Thanks in advance...
Upvotes: 0
Views: 2629
Reputation: 2155
Simply put all the urls in a text file like:
http://example1.com
http://example2.com
http://example3.com
Then, use this list to loop across like,
from goose import Goose
# Read list of hundreds of urls from a file
url_list = open("url_list.txt", "r").read().split("\n")
# loop for each url
for url in url_list:
g = Goose()
article = g.extract(url=url)
# process/store ...
article.cleaned_text
Later, as you have the text required for analysis, go ahead with storing and then processing in a separate code blocks.
Upvotes: 2
Reputation: 1133
Yes, You can either iterate on a "list" (which is a python object) of urls, or get those urls from a file:
Get Urls from a list:
from goose import Goose
list_of_urls = ['url1','url2','url1000'] #etc
g = Goose()
for url in list_of_urls:
article = g.extract(url=url)
article.cleaned_text
#do more stuff
Read urls from file:
with open(url_filename_here) as url_file:
lines = url_file.readlines()
#each line should contain a different url
for line in lines:
article = g.extract(url=line)
#do_more_stuff
Upvotes: 0