p1nesap
p1nesap

Reputation: 179

Connecting Wget and Sed Commands in One Script?

I use 3 commands (wget/sed/and a tr/sort) that all work in command line to produce a most-common words list. I use commands sequentially, saving output from sed to use in the tr/sort command. Now I need to graduate to writing a script that combines these 3 commands. So, 1) wget downloads a file, that I put into 2) sed -e 's/<[^>]*>//g' wget-file.txt, and that output > goes to 3)

cat sed-output.txt | tr -cs A-Za-z\'  '\n' | tr A-Z  a-z | sort | uniq -c | 
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

I'm aware of the problem/debate about using regex to remove HTML tags, but these 3 commands are working for me for the moment. So thanks in helping pull this together.

Upvotes: 0

Views: 2618

Answers (2)

BMW
BMW

Reputation: 45223

Using awk.

wget -O- http://down.load/file| awk '{ gsub(/<[^>]*>/,"")                # remove the content in label <>
       $0=tolower($0)                    # convert all to lowercase
       gsub(/[^a-z]]*/," ")              # remove all non-letter chars and replaced by space
       for (i=1;i<=NF;i++) a[$i]++       # save each word in array a, and sum it.
     }END{for (i in a) print a[i],i|"sort -nr|head -100"}'   # print the result, sort it, and get the top 100 records only

Upvotes: 2

chaos
chaos

Reputation: 9262

This command should do the job:

wget -O- http://down.load/file | sed -e 's/<[^>]*>//g' | \
tr -cs A-Za-z\'  '\n' | tr A-Z  a-z | sort | uniq -c | \
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

Upvotes: 0

Related Questions