Reputation: 79
I need help with a text cloud program I'm working on. I realize it's homework, but I've gotten pretty far on my own, only to now be stumped for hours. I'm stuck on the web crawler part. The program is supposed to open a page, gather all the words from that page, and sort them by frequency. Then it's supposed to open any links on that page and get the words on that page, etc. The depth is controlled by a global variable DEPTH. In the end, it's supposed to put all the words from all pages together to form a text cloud.
I'm trying to use recursion to call a function to keep opening links until the depth is reached. The import statement at the top is only to use a function called getHTML(URL), which returns a tuple of the list of words on the page, and any links on the page.
Here is my code so far. Every function works as it should, except for getRecursiveURLs(url, DEPTH) and makeWords(i). I'm also not 100% sure about the counter(List) function at the bottom.
from hmc_urllib import getHTML
MAXWORDS = 50
DEPTH = 2
all_links = []
def getURL():
"""Asks the user for a URL"""
URL = input('Please enter a URL: ')
#all_links.append(URL)
return makeListOfWords(URL), getRecursiveURLs(URL, DEPTH)
def getRecursiveURLs(url, DEPTH):
"""Opens up all links and adds them to global all_links list,
if they're not in all_links already"""
s = getHTML(url)
links = s[1]
if DEPTH > 0:
for i in links:
getRecursiveURLs(i, DEPTH - 1)
if i not in all_links:
all_links.append(i)
#print('This is all_links in the IF', all_links)
makeWords(i)#getRecursiveURLs(i, DEPTH - 1)
#elif i in all_links:
# print('This is all_links in the ELIF', all_links)
# makeWords(i) #getRecursiveURLs(i, DEPTH - 1)
#print('All_links at the end', all_links)
return all_links
def makeWords(i):
"""Take all_links and create a dictionary for each page.
Then, create a final dictionary of all the words on all pages."""
for i in all_links:
FinalDict = makeListOfWords(i)
#print(all_links)
#makeListOfWords(i))
return FinalDict
def makeListOfWords(URL):
"""Gets the text from a webpage and puts the words into a list"""
text = getHTML(str(URL))
L = text[0].split()
return cleaner(L)
def cleaner(L):
"""Cleans the text of punctuation and removes words if they are in the stop list."""
stopList = ['', 'a', 'i', 'the', 'and', 'an', 'in', 'with', 'for',
'it', 'am', 'at', 'on', 'of', 'to', 'is', 'so', 'too',
'my', 'but', 'are', 'very', 'here', 'even', 'from',
'them', 'then', 'than', 'this', 'that', 'though']
x = [dePunc(c) for c in L]
for c in x:
if c in stopList:
x.remove(c)
a = [stemmer(c) for c in x]
return counter(a)
def dePunc( rawword ):
""" de-punctuationifies the input string """
L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <= c <= 'z' ]
word = ''.join(L)
return word
def stemmer(word):
"""Stems the words"""
# List of endings
endings = ['ed', 'es', 's', 'ly', 'ing', 'er', 'ers']
# This first case handles 3 letter suffixes WITH a doubled consonant. I.E. spammers -> spam
if word[len(word)-3:len(word)] in endings and word[-4] == word[-5]:
return word[0:len(word)-4]
# This case handles 3 letter suffixes WITHOUT a doubled consonant. I.E. players -> play
elif word[len(word)-3:len(word)] in endings and word[-4] != word[-5]:
return word[0:len(word)-3]
# This case handles 2 letter suffixes WITH a doubled consonant. I.E. spammed -> spam
elif word[len(word)-2:len(word)] in endings and word[-3] == word[-4]:
return word[0:len(word)-3]
# This case handles 2 letter suffixes WITHOUT a doubled consonant. I.E. played -> played
elif word[len(word)-2:len(word)] in endings and word[-3] != word[-4]:
return word[0:len(word)-3]
# If word not inflected, return as-is.
else:
return word
def counter(List):
"""Creates dictionary of words and their frequencies, 'sorts' them,
and prints them from most least frequent"""
freq = {}
result = {}
# Assign frequency to each word
for item in List:
freq[item] = freq.get(item,0) + 1
# 'Sort' the dictionary by frequency
for i in sorted(freq, key=freq.get, reverse=True):
if len(result) < MAXWORDS:
print(i, '(', freq[i], ')', sep='')
result[i] = freq[i]
return result
Upvotes: 1
Views: 1387
Reputation: 1141
It is not totally clear the exact requirements for the assignment but from what I can gather you are looking to visit all pages up to the DEPTH once and only once. Also, you want to get all of the words off all of the pages and work with the aggregate result. The snippet below is what you are looking for however it is untested(I do not have hmc_urllib). all_links
, makeWords
and makeListOfWords
have been removed but the rest of the code would be the same.
visited_links = []
def getURL():
url = input('Please enter a URL: ')
word_list = getRecursiveURLs(url, DEPTH)
return cleaner(word_list) # this prints the word count for all pages
def getRecursiveURLs(url, DEPTH):
text, links = getHTML(url)
visited_links.append(url)
returned_word_list = text.split()
#cleaner(text.split()) # this prints the word count for the current page
if DEPTH > 0:
for link in links:
if link not in visited_links:
returned_word_list += getRecursiveURLs(link, DEPTH - 1)
return returned_word_list
Once you have a list of cleaned and stemmed words you can use the following functions to generate the word count dictionary and print the word count dictionary respectively:
def counter(words):
"""
Example Input: ['spam', 'egg', 'egg', 'egg', 'spam', 'spam', 'egg', 'egg']
Example Output: {'spam': 3, 'egg', 5}
"""
return dict((word, x.count(word)) for word in set(words))
def print_count(word_count, word_max):
"""
Example Input: {'spam': 3, 'egg', 5}
Prints the word list up to the word_max sorted by frequency
"""
for word in sorted(word_count, key=word_count.get, reverse=True)[:word_max]:
print(word,'(', word_count[word], ')', sep= '')
Upvotes: 2