Causality
Causality

Reputation: 1123

In Linux shell or awk, how to replace the url in a line with its domain

For example, the input:

line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected]/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

Should results in

line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

It is possible to achieve it with awk one linear (sub and regex?) Otherwise, how would you implement it in bash?

Upvotes: 1

Views: 786

Answers (6)

ghoti
ghoti

Reputation: 46846

My quick & dirty sed solution would be this:

sed 's#//[^@]*@#//#;s#\([^/]\)/[^/][^ ]* #\1 #' file1

which like the others here doesn't restrict its activity to just the third column. This relies on the idea that the first non-doubled slash in the URL is where you want to start stripping, and that those magic double slashes don't appear anywhere else on the line.

To restrict things to just the third column, awk seems like a good bet. But you can't do backreferences using sub() or gsub() functions in most awk implementations, but you can use them in GAWK's gensub(), like this:

gawk '{$3=gensub(/\/\/([^@\/]+@)?([^\/]+).*/, "//\\2", "g", $3)} 1' file1

This is similar to but simpler than jaypal's solution as it uses only a single substitution, and it doesn't require that "www" be part of the hostname.

But you can also do this in pure bash:

while read one two three four; do
  method=${three%//*}
  host=${three#*//}
  host=${host#*@}
  host=${host%/*}
  three="$method//$host"
  echo "$one $two $three $four"
done < file1

Yep. You can do anything in bash. It just takes more typing. :)

Upvotes: 0

fredtantini
fredtantini

Reputation: 16556

Not the most beautiful regexp, but in sed :

$ sed -r 's|://([^/]*@)?([^/]*)[^ \t]*|://\2|g' < myfile
line1 col1-1 http://www.google.com/ col8
line2 col1-2 https://user:[email protected]/ col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

Upvotes: 1

Jotne
Jotne

Reputation: 41456

Here is another awk

awk '/http/ {split($3,a,"/");sub(/^.*@/,"",a[3]);$3=a[1]"//"a[3]}8' file
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

Upvotes: 1

Steve
Steve

Reputation: 54392

I think it would probably be better to use a URL parser. For example, Python has: urlparse which can be used to parse URLs into components. Here's some example code, run like:

python3 script.py file

Contents of script.py:

import sys
import csv
from urllib.parse import urlparse


with open(sys.argv[1], 'r') as csvfile:

    r = csv.reader(csvfile, delimiter=' ')

    for row in r:

        url = urlparse(row[2]);

        if (url.scheme and url.hostname):

            row[2] = url.scheme + "://" + url.hostname

        print(' '.join(row))

Results:

line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

Upvotes: 5

jaypal singh
jaypal singh

Reputation: 77095

With GNU awk you can do:

gawk '$3~/http/{$3=gensub(/([^/]+)\/\/([^/]+).*/,"\\1//\\2","g",$3);gsub(/\/\/.*www/,"//www",$3)}1' file

$ cat file
line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected]/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

$ awk '$3~/http/{$3=gensub(/([^/]+)\/\/([^/]+).*/,"\\1//\\2","g",$3);gsub(/\/\/.*www/,"//www",$3)}1' file
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected] col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8

Upvotes: 1

thom
thom

Reputation: 2332

replace "//user:password@" with "//"

sed 's:/.*@://:g' inputfile

Upvotes: 0

Related Questions