user5115661
user5115661

Reputation:

How do I convert a text file into a list while deleting duplicate words and sorting the list in Python?

I'm trying to read lines in a file, split the lines into words, and add the individual words to a list if they are not already in the list. Lastly, the words have to be sorted. I've been trying to get this right for a while, and I understand the concepts, but I'm not sure how to get the exact language and placement right. Here's what I have:

filename = raw_input("Enter file name: ")
openedfile = open(filename)
lst = list()
for line in openedfile:
    line.rstrip()
    words = line.split()
for word in words:
    if word not in lst:
        lst.append(words)
print lst

Upvotes: 4

Views: 412

Answers (4)

ekhumoro
ekhumoro

Reputation: 120808

If you're splitting the text file into words based on whitespace, just use split() on the whole thing. There's nothing to be gained by reading each line and stripping it, because split() already handles all that.

So to get the initial list of words, all you need is this:

filename = raw_input("Enter file name: ")
openedfile = open(filename)
wordlist = openedfile.read().split()

Then to remove duplicates, convert the word list to a set:

wordset = set(wordlist)

And finally sort it:

words = sorted(wordset)

This can all be simplified to three lines, like so:

filename = raw_input("Enter file name: ")
with open(filename) as stream:
    words = sorted(set(stream.read().split()))

(NB: the with statement will automatically close the file for you)

Upvotes: 1

adrianus
adrianus

Reputation: 3199

Some thoughts:

  1. Consider using a set for the words. Adding another word only really adds it when its not in there already.
  2. Open the file with 'r' to indicate read-only mode.
  3. Sorting is really easy (also for sets), just use sorted().

Something like this works:

filename = raw_input("Enter file name: ")

words = set()
with open(filename, 'r') as myfile:
    for line in myfile.readlines():
        new_words = line.strip().split(' ')
        words.update(new_words)

print sorted(words)

Upvotes: 0

Anand S Kumar
Anand S Kumar

Reputation: 91009

Inside the for loop for for word in words: when you do - lst.append(words) - it appends the whole words list into lst , I believe you intended to use - lst.append(word) .

Also, the for loop - for word in words: should be indented inside for line in openedfile: , so that you run the loop for each line.

And lastly, if you want to lexicographically sort the words, you should call - lst.sort() at the end.

Also, it would be better to use with statement to open the file, so that it can handle closing the file after everything is finish automatically.

Also, it may be easier to use set() as the initial data type to store the elements as the not in operator for list is O(n) , and as the list gets bigger, that would take more time. Whereas with a set data type, you do not need to worry about checking if the set already has the word, since set does not allow duplicates.

At the end, you can use list(..) to convert the set back to list and then sort it.

Example -

filename = raw_input("Enter file name: ")
with open(filename) as openedfile:
    s = set()
    for line in openedfile:
        line.rstrip()
        words = line.split()
        for word in words:
            s.update(words)
    lst = list(s)
    lst.sort()
    print lst

Upvotes: 0

clay
clay

Reputation: 1877

First, I think you want

lst.append(word)

to only append a word if not in the list already. You have lst.append(words). That is wrong.

Second, to sort just use

lst.sort()

Upvotes: 0

Related Questions