Reputation: 1704
I am looking for something like this, but instead of counting the number of duplicated lines I would need to count the number of duplicated bunch of lines.
For the sake of clarification, I have a file like this:
Separator
line11
line12
line13
Separator
line21
line22
line23
Separator
line11
line12
line13
Separator
line11
line12
line13
Separator
line31
line32
line33
Separator
line21
line22
line23
And I would excpect an output as follows
3: Separator
line11
line12
line13
2: Separator
line21
line22
line23
1: Separator
line31
line32
line33
Where: 3:,2: and 1: means the number of times each bunch of lines appears in the file.
I tried without success the following command:
sort all_lits.txt | uniq -c
and currently I am writing an awk command in order to obtain the information but nothing clear yet. As soon as I get some command to show I am going to publish it.
Is it possible to get this information using some combination of UNIX tools such as awk, grep, wc, sort. ect.?
I do know I could write a script to do it but I would like to avoid to do so. In the extreme case I will do.
Any help is going to be highly appreciated.
Upvotes: 0
Views: 121
Reputation: 246724
awk -v RS=Separator '
NR>1 {count[$0]++}
END {for (bunch in count) print count[bunch], RS, bunch}
' file
1 Separator
line31
line32
line33
2 Separator
line21
line22
line23
3 Separator
line11
line12
line13
There is no inherent order to the output. If you want sorted by count descending, and you're using GNU AWK:
awk -v RS=Separator '
NR>1 {count[$0]++}
END {
PROCINFO["sorted_in"] = "@val_num_desc"
for (bunch in count) print count[bunch], RS, bunch
}
' file
Upvotes: 2
Reputation: 1704
This is the script I am using. It is still in testing time but it may be used as a base for other people:
with open(file_name, mode="r") as bigfile:
reader = bigfile.read()
d = dict()
for res in reader.split('Separator'):
if res in d:
d[res]= d[res]+1
else:
d[res]=1
for k in d:
print str(k) + ':' + str(d[k])
Upvotes: 1