Reputation: 79

Finding identical lines over multiple files

I know there is a lot of similar questions, but none of them seem to work.

Let me begin:

I have two files, one called client-gen.txt, and another called server-gen.txt. The files contain randomly generated SHA1 strings, for example:

902ba3cda1883801594b6e1b452790cc53948fda
356a192b7913b04c54574d18c28d46e6395428ab
c1dfd96eea8cc2b62785275bca38ac261256e278
1b6453892473a467d07372d45eb05abc2031647a
77de68daecd823babbb58edb1c8e14d7106e83bb

Now, the question is, if I have the other file, looking like this:

77de68daecd823babbb58edb1c8e14d7106e83bb
da4b9237bacccdf19c0760cab7aec4a8359010b0
356a192b7913b04c54574d18c28d46e6395428ab
1b6453892473a467d07372d45eb05abc2031647a
356a192b7913b04c54574d18c28d46e6395428ab
ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4
da4b9237bacccdf19c0760cab7aec4a8359010b0
356a192b7913b04c54574d18c28d46e6395428ab
1b6453892473a467d07372d45eb05abc2031647a
da4b9237bacccdf19c0760cab7aec4a8359010b0

How can I compare these files, and just print in this case:

77de68daecd823babbb58edb1c8e14d7106e83bb
1b6453892473a467d07372d45eb05abc2031647a
1b6453892473a467d07372d45eb05abc2031647a

The order is not important.

FYI, I have already tried using set() and other methods. None of them seem to work.

If you can help, I really appreciate it

Upvotes: 0

Answers (4)

Padraic Cunningham

Reputation: 180411

I presume your expected output is incorrect as you don't include '356a192b7913b04c54574d18c28d46e6395428ab' which appears in both and furthermore appears twice in your second file, if you want elements that appear in both files use set.intersection:

with open("a.txt") as a, open("b.txt") as b:
    st = set(map(str.rstrip,a))
    print("\n".join(st.intersection(map(str.rstrip,b))))


356a192b7913b04c54574d18c28d46e6395428ab
1b6453892473a467d07372d45eb05abc2031647a
77de68daecd823babbb58edb1c8e14d7106e83bb

Upvotes: 1

Cody Piersall

Reputation: 8547

You can use a Counter, and then only print the items with a value of 2. Since an open file is iterable (i.e. you can iterate over the lines when you use a for loop), you can call Counter directly on the open files:

from collections import Counter
with open('file1') as file1, open('file2') as file2:
    ids = Counter(file1)
    ids.update(file2)
for key, value in ids.items():
    if value > 1:
        print(key)

This method will include the trailing newlines. It's likely that this isn't what you want; if that is the case, you'll have to iterate over the files explicitly and remove the whitespace explicitly:

from collections import Counter
with open('file1') as file1, open('file2') as file2:
    ids = Counter()
    for line in file1:
        ids.update([line.strip()])
    for line in file2:
        ids.update([line.strip()])

for key, value in ids.items():
    if value > 1:
        print(key)

Upvotes: 1

iced

Reputation: 1572

cl = [l.strip() for l in open("client-gen.txt")]
sl = [l.strip() for l in open("server-gen.txt")]
common = filter(lambda l: l in sl, cl)

Upvotes: 0

user1196549

Reputation:

Sort both files in in alphabetic order. Then in a single merge-like pass, you will find all duplicates.

Upvotes: 1

Finding identical lines over multiple files

Answers (4)

Related Questions