Reputation: 1189
Hi I want to check whether a set of words (alphanumeric) contained in one file are in another file containing some set of words.
Like I have a file: f1.txt (20K size)
w1
w2
w3
w4
.. //more ids like this
Another file f2.txt (120 K size)
q1
q2
q3
q4
q5
q6
q7
q8
w2
So I want to check "how" many and "which" ids from "f1.txt" are present in "f2.txt"
I want the output to be like:
1
w2
I know this is easy and can be done using loops. I want to know if we can do this using bash scripting, using "grep" n all. As this is fast, I mainly want to analyze the data. Python would also do.
Any leads appreciated.
Upvotes: 1
Views: 74
Reputation: 297
You can use
str.count(sub[, start[, end]])
Return the number of non-overlapping occurrences of substring
sub
in the range[start, end]
. Optional argumentsstart
andend
are interpreted as in slice notation.
f1_lines = [line.strip("\n") for line in f1.readlines()]
f2_lines = [line.strip("\n") for line in f2.readlines()]
for w in f1_lines:
print(w, f2_lines.count(w))
Upvotes: 1
Reputation: 195079
since the file is not so big, we can put them in memory (an awk hashtable) to compare:
awk 'NR==FNR{a[$0];next}$0 in a{a[$0]++}
END{for(x in a)if(a[x])print x, a[x]}' f1 f2
It outputs:
w2 1
(The output is just for example, output format could be easily adjusted.)
awk # the awk cmd
'NR==FNR{a[$0];next} # take the first file:f1, save in hashtable a[word]=0
$0 in a{a[$0]++} # take the 2nd file:f2, if word in a hit, increment
END{ # after two files are processed, we r about to print
for(x in a) # go thru the hashtable
if(a[x]) # if value>0 (the word shows in f2)
print x, a[x]}' # we print the which word(key), and how many times(value)
f1 f2 # two input files.
Upvotes: 3