user2696258
user2696258

Reputation: 1189

Checking if a set of string in a file are in another file or not using python or bash

Hi I want to check whether a set of words (alphanumeric) contained in one file are in another file containing some set of words.

Like I have a file: f1.txt (20K size)

w1
w2
w3
w4
.. //more ids like this

Another file f2.txt (120 K size)

q1
q2
q3
q4
q5
q6
q7
q8
w2

So I want to check "how" many and "which" ids from "f1.txt" are present in "f2.txt"

I want the output to be like:

1
w2

I know this is easy and can be done using loops. I want to know if we can do this using bash scripting, using "grep" n all. As this is fast, I mainly want to analyze the data. Python would also do.

Any leads appreciated.

Upvotes: 1

Views: 74

Answers (2)

Ibrahim
Ibrahim

Reputation: 297

You can use

str.count(sub[, start[, end]])

Return the number of non-overlapping occurrences of substring sub in the range [start, end]. Optional arguments start and end are interpreted as in slice notation.

f1_lines = [line.strip("\n") for line in f1.readlines()]
f2_lines = [line.strip("\n") for line in f2.readlines()]

for w in f1_lines:
    print(w, f2_lines.count(w))

Upvotes: 1

Kent
Kent

Reputation: 195079

since the file is not so big, we can put them in memory (an awk hashtable) to compare:

awk 'NR==FNR{a[$0];next}$0 in a{a[$0]++}
  END{for(x in a)if(a[x])print x, a[x]}' f1 f2

It outputs:

w2 1

(The output is just for example, output format could be easily adjusted.)

awk                    # the awk cmd
'NR==FNR{a[$0];next}   # take the first file:f1, save in hashtable a[word]=0
$0 in a{a[$0]++}       # take the 2nd file:f2, if word in a hit, increment
END{                   # after two files are processed, we r about to print
   for(x in a)         # go thru the hashtable
    if(a[x])           # if value>0 (the word shows in f2)
     print x, a[x]}'   # we print the which word(key), and how many times(value) 
f1 f2                  # two input files.

Upvotes: 3

Related Questions