user9442800
user9442800

Reputation:

Need help deleting repeating lines in txt file

I need to have an output printed in which only 1 list is split with no duplicates. The list i am using has like 100k emails and 1000x repeat. I want to remove those ..

I have tried some i have looked online

but nothing is written in my new file and the pycharm just freezes on running

def uniquelines(lineslist):
    unique = {}
    result = []
    for item in lineslist:
        if item.strip() in unique: continue
            unique[item.strip()] = 1
            result.append(item)
    return result

file1 = open("wordlist.txt","r")
filelines = file1.readlines()
file1.close()

output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()

I expect it to just print all the emails with none repeating into a new text file

Upvotes: 0

Views: 61

Answers (1)

Cohan
Cohan

Reputation: 4544

Before I get into the few ways to hopefully solve the issue, one thing I see off the bat is that you are using both a dictionary and a list within your function. This almost doubles the memory you will need to process things. I suggest using one or the other.

Using a set will provide you with a guaranteed "list" of unique items. The set.add() function will ignore duplicates.

s = {1, 2, 3}
print(s) #{1, 2, 3}
s.add(4)
print(s) #{1, 2, 3, 4}
s.add(4)
print(s) #{1, 2, 3, 4}

With that, you can modify your function to the following to achieve what you want. For my example, I have input.txt as a series of lines just containing a single integer value with plenty of duplicates.

def uniquelines(lineslist):
    unique = set()

    for line in lineslist:
        unique.add(str(line).strip())

    return list(unique)

with open('input.txt', 'r') as f:
    lines = f.readlines()

output = uniquelines(lines)

with open('output.txt', 'w') as f:
    f.write("\n".join([i for i in output]))

output.txt is as follows without any duplicates!

2
0
4
5
3
1
9
6

You can accomplish the same thing by calling set() on a list comprehension, but the disadvantage here is that you will need to load all the records into memory first and then pull out the duplicates. THe method above will hold all the unique values, but no duplicates, so depending on the size of your set, you probably want to use the function.

with open('input.txt', 'r') as f:
    lines = f.readlines()

output = set([l.strip() for l in lines])

with open('output.txt', 'w') as f:
    f.write("\n".join([i for i in output]))

I couldn't quite tell if you were looking to maintain a running count of how many times each unique line occured. If that's what you're going for, then you can use the in operator to see if it is in the keys already.

def uniquelines(lineslist):
    unique = {}

    for line in lineslist:
        line = line.strip()

        if line in unique:
            unique[line] += 1
        else:
            unique[line] = 1

    return unique

# {'9': 2, '0': 3, '4': 3, '1': 1, '3': 4, '2': 1, '6': 3, '5': 1}

Upvotes: 1

Related Questions