smous
smous

Reputation:

How might I remove duplicate lines from a file?

I have a file with one column. How to delete repeated lines in a file?

Upvotes: 49

Views: 151188

Answers (15)

Karree
Karree

Reputation: 7

Here is my solution

d = input("your file:") #write your file name here
file1 = open(d, mode="r")
file2 = open('file2.txt', mode='w')
file2 = open('file2.txt', mode='a')
file1row = file1.readline()


while file1row != "" :
    file2 = open('file2.txt', mode='a')
    file2read = open('file2.txt', mode='r')
    file2r = file2read.read().strip()
    if file1row not in file2r:
        file2.write(file1row)   
    file1row = file1.readline()
    file2read.close()
    file2.close

Upvotes: 0

marcell
marcell

Reputation: 1402

uniqlines = set(open('/tmp/foo').readlines())

this will give you the list of unique lines.

writing that back to some file would be as easy as:

bar = open('/tmp/bar', 'w').writelines(uniqlines)

bar.close()

Upvotes: 28

shahjapan
shahjapan

Reputation: 14335

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1","line2","line3","line2","line1"]
>>> list(set(x))
['line3', 'line2', 'line1']
>>>

If you need to preserve the ordering of lines - as set is unordered collection - try this:

y = []
for l in x:
    if l not in y:
        y.append(l)

and write the content back to the file.

Upvotes: 8

Ashwaq
Ashwaq

Reputation: 459

cat <filename> | grep '^[a-zA-Z]+$' | sort -u > outfile.txt

To filter and remove duplicate values from the file.

Upvotes: 0

Ravgeet Dhillon
Ravgeet Dhillon

Reputation: 590

Readable and Concise

with open('sample.txt') as fl:
    content = fl.read().split('\n')

content = set([line for line in content if line != ''])

content = '\n'.join(content)

with open('sample.txt', 'w') as fl:
    fl.writelines(content)

Upvotes: 2

hamed alholi
hamed alholi

Reputation: 69

edit it within the same file

lines_seen = set() # holds lines already seen

with open("file.txt", "r+") as f:
    d = f.readlines()
    f.seek(0)
    for i in d:
        if i not in lines_seen:
            f.write(i)
            lines_seen.add(i)
    f.truncate()

Upvotes: 2

Torkoal
Torkoal

Reputation: 487

If anyone is looking for a solution that uses a hashing and is a little more flashy, this is what I currently use:

def remove_duplicate_lines(input_path, output_path):

    if os.path.isfile(output_path):
        raise OSError('File at {} (output file location) exists.'.format(output_path))

    with open(input_path, 'r') as input_file, open(output_path, 'w') as output_file:
        seen_lines = set()

        def add_line(line):
            seen_lines.add(line)
            return line

        output_file.writelines((add_line(line) for line in input_file
                                if line not in seen_lines))

Upvotes: 2

David Duluc
David Duluc

Reputation: 31

Look at script I created to remove duplicate emails from text files. Hope this helps!

# function to remove duplicate emails
def remove_duplicate():
    # opens emails.txt in r mode as one long string and assigns to var
    emails = open('emails.txt', 'r').read()
    # .split() removes excess whitespaces from str, return str as list
    emails = emails.split()
    # empty list to store non-duplicate e-mails
    clean_list = []
    # for loop to append non-duplicate emails to clean list
    for email in emails:
        if email not in clean_list:
            clean_list.append(email)
    return clean_list
    # close emails.txt file
    emails.close()
# assigns no_duplicate_emails.txt to variable below
no_duplicate_emails = open('no_duplicate_emails.txt', 'w')

# function to convert clean_list 'list' elements in to strings
for email in remove_duplicate():
    # .strip() method to remove commas
    email = email.strip(',')
    no_duplicate_emails.write(f"E-mail: {email}\n")
# close no_duplicate_emails.txt file
no_duplicate_emails.close()

Upvotes: 3

All Іѕ Vаиітy
All Іѕ Vаиітy

Reputation: 26422

adding to @David Locke's answer, with *nix systems you can run

sort -u messy_file.txt > clean_file.txt

which will create clean_file.txt removing duplicates in alphabetical order.

Upvotes: 4

Arthur M
Arthur M

Reputation: 458

Its a rehash of whats already been said here - here what I use.

import optparse

def removeDups(inputfile, outputfile):
        lines=open(inputfile, 'r').readlines()
        lines_set = set(lines)
        out=open(outputfile, 'w')
        for line in lines_set:
                out.write(line)

def main():
        parser = optparse.OptionParser('usage %prog ' +\
                        '-i <inputfile> -o <outputfile>')
        parser.add_option('-i', dest='inputfile', type='string',
                        help='specify your input file')
        parser.add_option('-o', dest='outputfile', type='string',
                        help='specify your output file')
        (options, args) = parser.parse_args()
        inputfile = options.inputfile
        outputfile = options.outputfile
        if (inputfile == None) or (outputfile == None):
                print parser.usage
                exit(1)
        else:
                removeDups(inputfile, outputfile)

if __name__ == '__main__':
        main()

Upvotes: 6

MLSC
MLSC

Reputation: 5972

You can do:

import os
os.system("awk '!x[$0]++' /path/to/file > /path/to/rem-dups")

Here You are using bash into python :)

You have also other way:

with open('/tmp/result.txt') as result:
        uniqlines = set(result.readlines())
        with open('/tmp/rmdup.txt', 'w') as rmdup:
            rmdup.writelines(set(uniqlines))

Upvotes: 8

Rahul Patil
Rahul Patil

Reputation: 1034

Python One liners :

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile > OutputFile

Upvotes: 4

Unknown92
Unknown92

Reputation: 41

Here is my solution

if __name__ == '__main__':
f = open('temp.txt','w+')
flag = False
with open('file.txt') as fp:
    for line in fp:
        for temp in f:
            if temp == line:
                flag = True
                print('Found Match')
                break
        if flag == False:
            f.write(line)
        elif flag == True:
            flag = False
        f.seek(0)
    f.close()

Upvotes: 1

Vinay Sajip
Vinay Sajip

Reputation: 99355

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write(line) and instead, immediately after the loop, do outfile.writelines(sorted(lines_seen)).

Upvotes: 86

David Locke
David Locke

Reputation: 18074

If you're on *nix, try running the following command:

sort <file name> | uniq

Upvotes: 49

Related Questions