Extracting specific data from text file - bash

Question

i have a huge text file (~4.5GB in size) that holds ~48 million lines. all line are in the following syntax:

    country01/city01/street01/building01
    country01/city01/street01/building02
    country01/city01/street02/building01
    country01/city01/street02/building02
    country01/city02/street01/building01
    .
    .
    etc...

i'm trying to find a quick way to cut out the street names and the amount of buildings it holds. i tried various combinations of sed and awk with the wc -l option but it gets messy and i'm definitely missing something.

will appreciate any help!

kvantour · Accepted Answer

If you just need to know the amount of buildings in a a street, you can do the following:

$ cut -d'/' -f-3 file | sort | uniq -c

This will give you a sorted list of streets and a count next to it

2 country01/city01/street01
2 country01/city01/street02
1 country01/city02/street01

If there might be duplicates in your list you can do this:

$ sort -u file | cut -d'/' -f-3 | uniq -c

If you really have an enormous file that might not fit into your memory and sort takes a bit long, you can do the following:

$ awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}' file

or if you might have duplicates:

$ awk '($0 in a){next}{print; a[$0]}' file | awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}'

Extracting specific data from text file - bash

Answers (1)

Related Questions