Khrystyna Pyurkovska
Khrystyna Pyurkovska

Reputation: 99

how to delete duplicate lines in a file in Python

I have a file with duplicate lines. What I want is to delete one duplicate to have a file with unique lines. But i get an error output.writelines(uniquelines(filelines)) TypeError: writelines() argument must be a sequence of strings I have searched the same issues but i still don-t understand what is wrong. My code:

def uniquelines(lineslist):
    unique = {}
    result = []
    for item in lineslist:
        if item.strip() in unique: continue
        unique[item.strip()] = 1
        result.append(item)
    return result
file1 = codecs.open('organizations.txt','r+','cp1251')
filelines = file1.readlines()
file1.close()
output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()

Upvotes: 3

Views: 6791

Answers (5)

Hello a got other solve:

For this file:

01 WLXB64US
01 WLXB64US
02 WLWB64US
02 WLWB64US
03 WLXB67US
03 WLXB67US
04 WLWB67US
04 WLWB67US
05 WLXB93US
05 WLXB93US
06 WLWB93US
06 WLWB93US

Solution:

def deleteDuplicate():
    try:
        f = open('file.txt','r')
        lstResul = f.readlines()
        f.close()
        datos = []
        for lstRspn in lstResul:
            datos.append(lstRspn)
        lstSize = len(datos)
        i = 0
        f = open('file.txt','w')
        while i < lstSize:
            if i == 0:
                f.writelines(datos[i])
            else:
                if (str(datos[i-1].strip())).replace(' ','') == (str(datos[i].strip())).replace(' ',''):
                    print('next...')
                else:
                    f.writelines(datos[i])
            i = i + 1

    except Exception as err:

Upvotes: 0

Tritium21
Tritium21

Reputation: 2922

It is rather common in python to remove duplicate objects from a sequence using a set. The only downside to using set is you lose order (the same way you loose order in dictionary keys, in fact its the same exact reason, but that's not important.) If order in your files matters, you can use the keys of an OrderedDict (standard library as of... 2.7 I think) to act as a psudo-set, and remove duplicate strings from a sequence of strings. If order does not matter, use set() instead of collections.OrderedDict.fromkeys(). Using the file modes 'rb' (read binary) and 'wb' (write binary), you stop having to worry about encoding - Python will just treat them as bytes. This uses a context manager syntax introduced later than 2.5, so you may need to adjust with context lib as-needed if this is a syntax error for you.

import collections

with open(infile, 'rb') as inf, open(outfile, 'wb') as outf:
    outf.writelines(collections.OrderedDict.fromkeys(inf))

Upvotes: 0

Hutch
Hutch

Reputation: 10694

i wouldn't bother encoding or decoding at all .. open with simplyopen('organizations'txt', 'rb') as well as open('wordlist_unique.txt', 'wb') and you should be fine.

Upvotes: 1

jocke-l
jocke-l

Reputation: 703

If you don't need to have the lines in order afterwards, I suggest you to put the strings in a set. set(linelist). The lineorder would be screwed up but the duplicates would be gone.

Upvotes: 0

falsetru
falsetru

Reputation: 368904

The code uses different open: codecs.open when it reads, open when it writes.

readlines of file object created using codecs.open returns list of unicode strings. While writelines of file objects create using open expect a sequence of (bytes) strings.

Replace following lines:

output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()

with:

output = codecs.open("wordlist_unique.txt", "w", "cp1251")
output.writelines(uniquelines(filelines))
output.close()

or preferably (using with statement):

with codecs.open("wordlist_unique.txt", "w", "cp1251") as output:
    output.writelines(uniquelines(filelines))

Upvotes: 3

Related Questions