Adrian Andronic
Adrian Andronic

Reputation: 79

Web Crawler Text Cloud

I need help with a text cloud program I'm working on. I realize it's homework, but I've gotten pretty far on my own, only to now be stumped for hours. I'm stuck on the web crawler part. The program is supposed to open a page, gather all the words from that page, and sort them by frequency. Then it's supposed to open any links on that page and get the words on that page, etc. The depth is controlled by a global variable DEPTH. In the end, it's supposed to put all the words from all pages together to form a text cloud.

I'm trying to use recursion to call a function to keep opening links until the depth is reached. The import statement at the top is only to use a function called getHTML(URL), which returns a tuple of the list of words on the page, and any links on the page.

Here is my code so far. Every function works as it should, except for getRecursiveURLs(url, DEPTH) and makeWords(i). I'm also not 100% sure about the counter(List) function at the bottom.

from hmc_urllib import getHTML

MAXWORDS = 50
DEPTH = 2

all_links = []

def getURL():
    """Asks the user for a URL"""

    URL = input('Please enter a URL: ')

    #all_links.append(URL)

    return makeListOfWords(URL), getRecursiveURLs(URL, DEPTH)


def getRecursiveURLs(url, DEPTH):
    """Opens up all links and adds them to global all_links list,
    if they're not in all_links already"""

    s = getHTML(url)
    links = s[1]
    if DEPTH > 0:
        for i in links:
            getRecursiveURLs(i, DEPTH - 1)
            if i not in all_links:
                all_links.append(i)
                #print('This is all_links in the IF', all_links)
                makeWords(i)#getRecursiveURLs(i, DEPTH - 1)
            #elif i in all_links:

             #   print('This is all_links in the ELIF', all_links)
              #  makeWords(i) #getRecursiveURLs(i, DEPTH - 1)
    #print('All_links at the end', all_links)
    return all_links





def makeWords(i):
    """Take all_links and create a dictionary for each page.
    Then, create a final dictionary of all the words on all pages."""

    for i in all_links:
        FinalDict = makeListOfWords(i)
        #print(all_links)
        #makeListOfWords(i))
    return FinalDict


def makeListOfWords(URL):
    """Gets the text from a webpage and puts the words into a list"""

    text = getHTML(str(URL))
    L = text[0].split()
    return cleaner(L)


def cleaner(L):

    """Cleans the text of punctuation and removes words if they are in the stop list."""

    stopList = ['', 'a', 'i', 'the', 'and', 'an', 'in', 'with', 'for',
                'it', 'am', 'at', 'on', 'of', 'to', 'is', 'so', 'too',
                'my', 'but', 'are', 'very', 'here', 'even', 'from',
                'them', 'then', 'than', 'this', 'that', 'though']

    x = [dePunc(c) for c in L]

    for c in x:
        if c in stopList:
            x.remove(c)

    a = [stemmer(c) for c in x]

    return counter(a)


def dePunc( rawword ):
    """ de-punctuationifies the input string """

    L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <= c <= 'z' ]
    word = ''.join(L)
    return word


def stemmer(word):

    """Stems the words"""

    # List of endings
    endings = ['ed', 'es', 's', 'ly', 'ing', 'er', 'ers']

    # This first case handles 3 letter suffixes WITH a doubled consonant. I.E. spammers -> spam
    if word[len(word)-3:len(word)] in endings and word[-4] == word[-5]:
        return word[0:len(word)-4]

    # This case handles 3 letter suffixes WITHOUT a doubled consonant. I.E. players -> play
    elif word[len(word)-3:len(word)] in endings and word[-4] != word[-5]:
        return word[0:len(word)-3]

    # This case handles 2 letter suffixes WITH a doubled consonant. I.E. spammed -> spam
    elif word[len(word)-2:len(word)] in endings and word[-3] == word[-4]:
        return word[0:len(word)-3]

    # This case handles 2 letter suffixes WITHOUT a doubled consonant. I.E. played -> played
    elif word[len(word)-2:len(word)] in endings and word[-3] != word[-4]:
        return word[0:len(word)-3]

    # If word not inflected, return as-is.
    else:
        return word

def counter(List):
    """Creates dictionary of words and their frequencies, 'sorts' them,
    and prints them from most least frequent"""

    freq = {}
    result = {}
 # Assign frequency to each word
    for item in List:
        freq[item] = freq.get(item,0) + 1

    # 'Sort' the dictionary by frequency
    for i in sorted(freq, key=freq.get, reverse=True):
        if len(result) < MAXWORDS:
            print(i, '(', freq[i], ')', sep='')
            result[i] = freq[i]
    return result

Upvotes: 1

Views: 1387

Answers (1)

Jesse Harris
Jesse Harris

Reputation: 1141

It is not totally clear the exact requirements for the assignment but from what I can gather you are looking to visit all pages up to the DEPTH once and only once. Also, you want to get all of the words off all of the pages and work with the aggregate result. The snippet below is what you are looking for however it is untested(I do not have hmc_urllib). all_links, makeWords and makeListOfWords have been removed but the rest of the code would be the same.

visited_links = []

def getURL():
    url = input('Please enter a URL: ')
    word_list = getRecursiveURLs(url, DEPTH)
    return cleaner(word_list) # this prints the word count for all pages

def getRecursiveURLs(url, DEPTH):
    text, links  = getHTML(url)
    visited_links.append(url)
    returned_word_list = text.split()
    #cleaner(text.split()) # this prints the word count for the current page

    if DEPTH > 0:
        for link in links:
            if link not in visited_links:
                returned_word_list += getRecursiveURLs(link, DEPTH - 1)
    return returned_word_list

Once you have a list of cleaned and stemmed words you can use the following functions to generate the word count dictionary and print the word count dictionary respectively:

def counter(words):
    """
    Example Input: ['spam', 'egg', 'egg', 'egg', 'spam', 'spam', 'egg', 'egg']
    Example Output: {'spam': 3, 'egg', 5}
    """
    return dict((word, x.count(word)) for word in set(words))

def print_count(word_count, word_max):
    """
    Example Input: {'spam': 3, 'egg', 5}
    Prints the word list up to the word_max sorted by frequency
    """
    for word in sorted(word_count, key=word_count.get, reverse=True)[:word_max]:
        print(word,'(', word_count[word], ')', sep= '')

Upvotes: 2

Related Questions