pafede2
pafede2

Reputation: 1704

Linux command or script counting duplicated bunch of lines in a text file?

I am looking for something like this, but instead of counting the number of duplicated lines I would need to count the number of duplicated bunch of lines.

For the sake of clarification, I have a file like this:

Separator
line11
line12
line13
Separator
line21
line22
line23
Separator
line11
line12
line13
Separator
line11
line12
line13
Separator
line31
line32
line33
Separator
line21
line22
line23

And I would excpect an output as follows

3:    Separator
      line11
      line12
      line13
2:    Separator
      line21
      line22
      line23
1:   Separator
      line31
      line32
      line33

Where: 3:,2: and 1: means the number of times each bunch of lines appears in the file.

I tried without success the following command:

sort all_lits.txt | uniq -c

and currently I am writing an awk command in order to obtain the information but nothing clear yet. As soon as I get some command to show I am going to publish it.

Is it possible to get this information using some combination of UNIX tools such as awk, grep, wc, sort. ect.?

I do know I could write a script to do it but I would like to avoid to do so. In the extreme case I will do.

Any help is going to be highly appreciated.

Upvotes: 0

Views: 121

Answers (2)

glenn jackman
glenn jackman

Reputation: 246724

awk -v RS=Separator '
    NR>1 {count[$0]++}
    END {for (bunch in count) print count[bunch], RS, bunch}
' file
1 Separator 
line31
line32
line33

2 Separator 
line21
line22
line23

3 Separator 
line11
line12
line13

There is no inherent order to the output. If you want sorted by count descending, and you're using GNU AWK:

awk -v RS=Separator '
    NR>1 {count[$0]++}
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (bunch in count) print count[bunch], RS, bunch
    }
' file

Upvotes: 2

pafede2
pafede2

Reputation: 1704

This is the script I am using. It is still in testing time but it may be used as a base for other people:

with open(file_name, mode="r") as bigfile:
reader = bigfile.read()

d = dict()
for res in reader.split('Separator'):
  if res in d:
    d[res]= d[res]+1
  else:
    d[res]=1

for k in d:
  print str(k) + ':' + str(d[k])

Upvotes: 1

Related Questions