Reputation: 11

Compression function for words.. python

This is supposed to be a compression function. We're supposed to read in a text file with just words, and sort them by frequencies and by the amount of words. Upper and lower case. I don't an answer to solve it, I just want help.

for each word in the input list
    if the word is new to the list of unique words
        insert it into the list and set its frequency to 1
    otherwise
        increase its frequency by 1

sort unique word list by frequencies (function)

open input file again
open output file with appropriate name based on input filename (.cmp)
write out to the output file all the words in the unique words list

for each line in the file (delimited by newlines only!)
   split the line into words
   for each word in the line
     look up each word in the unique words list
       and write its location out to the output file
     don't output a newline character
   output a newline character after the line is finished

close both files
tell user compression is done

This is my next step:

def compression():

    for line in infile:
        words = line.split()    

def get_file():

    opened = False
    fn = input("enter a filename ")
    while not opened:
        try:
            infile = open(fn, "r")
            opended = True
        except:
            print("Won't open")
            fn = input("enter a filename ")

    return infile

def compression():

    get_file()

    data = infile.readlines()
    infile.close()

    for line in infile:
        words = line.split()

    words = []
    word_frequencies = []

def main():

    input("Do you want to compress or decompress? Enter 'C' or 'D' ")

main()

Upvotes: 0

Answers (2)

Peter DeGlopper

Reputation: 37319

Thanks to the pseudocode I can more or less figure out what this should look like. I have to do some guesswork since you say you're restricted to what you've covered in class, and we have no way to know what that includes and what it doesn't.

I am assuming you can handle opening the input file and splitting it into words. That's pretty basic stuff - if your class is asking you to handle input files it must have covered it. If not, the main thing to look back to is that you can iterate over a file and get its lines:

with open('example.txt') as infile:
    for line in infile:
        words = line.split()

Now, for each word you need to keep track of two things - the word itself and its frequency. Your problem requires you to use lists to store your information. This means you have to either use two different lists, one storing the words and one storing their frequencies, or use one list that stores two facts for each of its positions. There are disadvantages either way - lists are not the best data structure to use for this, but your problem definition puts the better tools off limits for now.

I would probably use two different lists, one containing the words and one containing the frequency counts for the word in the same position in the word list. The advantage there is that that would allow using the index method on the word list to find the list position of a given word, and then increment the matching frequency count. That will be much faster than searching a list that stores both the word and its frequency count using a for loop. The down side is that sorting is harder - you have to find a way to retain the list position of each frequency when you sort, or to combine the words and their frequencies before sorting. In this approach, you probably need to build a list data that stores two pieces of information - the frequency count and then either the word list index or the word itself - and then sort that list by the frequency count.

I hope that part of the point of this exercise is to help drive home how useful the better data structures are when you're allowed to use them.

edited So the inner loop is going to look something like this:

words = []
word_frequencies = []
for line in infline:
    for word in line.split():
        try:
            word_position = words.index(word)
        except ValueError:
            # word is not in words
            words.append(word)
            # what do you think should happen to word_frequencies here?
        else:
            # now word_position is a number giving the position in both words and word_frequencies for the word
            # what do you think should happen to word_frequences here?

Upvotes: 1

Monkeyanator

Reputation: 1416

So I'm not going to do an entire assignment for you, but I can try my best and walk you through it one by one.

It would seem the first step is to create an array of all the words in the text file (assuming you know file reading methods). For this, you should look into Python's split function (regular expressions can be used for more complex variations of this). So you need to store each word somewhere, and pair that value with the amount of times it appears. Sounds like the job of a dictionary, to me. That should get you off on the right track.

Upvotes: 1

Compression function for words.. python

Answers (2)

Related Questions