Reputation: 1
i have a huge text file (~4.5GB in size) that holds ~48 million lines. all line are in the following syntax:
country01/city01/street01/building01
country01/city01/street01/building02
country01/city01/street02/building01
country01/city01/street02/building02
country01/city02/street01/building01
.
.
etc...
i'm trying to find a quick way to cut out the street names and the amount of buildings it holds.
i tried various combinations of sed
and awk
with the wc -l
option but it gets messy and i'm definitely missing something.
will appreciate any help!
Upvotes: 0
Views: 59
Reputation: 26571
If you just need to know the amount of buildings in a a street, you can do the following:
$ cut -d'/' -f-3 file | sort | uniq -c
This will give you a sorted list of streets and a count next to it
2 country01/city01/street01
2 country01/city01/street02
1 country01/city02/street01
If there might be duplicates in your list you can do this:
$ sort -u file | cut -d'/' -f-3 | uniq -c
If you really have an enormous file that might not fit into your memory and sort
takes a bit long, you can do the following:
$ awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}' file
or if you might have duplicates:
$ awk '($0 in a){next}{print; a[$0]}' file | awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}'
Upvotes: 2