Reputation: 1123
For example, the input:
line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected]/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Should results in
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
It is possible to achieve it with awk
one linear (sub
and regex
?) Otherwise, how would you implement it in bash?
Upvotes: 1
Views: 786
Reputation: 46846
My quick & dirty sed solution would be this:
sed 's#//[^@]*@#//#;s#\([^/]\)/[^/][^ ]* #\1 #' file1
which like the others here doesn't restrict its activity to just the third column. This relies on the idea that the first non-doubled slash in the URL is where you want to start stripping, and that those magic double slashes don't appear anywhere else on the line.
To restrict things to just the third column, awk seems like a good bet. But you can't do backreferences using sub()
or gsub()
functions in most awk implementations, but you can use them in GAWK's gensub()
, like this:
gawk '{$3=gensub(/\/\/([^@\/]+@)?([^\/]+).*/, "//\\2", "g", $3)} 1' file1
This is similar to but simpler than jaypal's solution as it uses only a single substitution, and it doesn't require that "www" be part of the hostname.
But you can also do this in pure bash:
while read one two three four; do
method=${three%//*}
host=${three#*//}
host=${host#*@}
host=${host%/*}
three="$method//$host"
echo "$one $two $three $four"
done < file1
Yep. You can do anything in bash. It just takes more typing. :)
Upvotes: 0
Reputation: 16556
Not the most beautiful regexp, but in sed :
$ sed -r 's|://([^/]*@)?([^/]*)[^ \t]*|://\2|g' < myfile
line1 col1-1 http://www.google.com/ col8
line2 col1-2 https://user:[email protected]/ col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Upvotes: 1
Reputation: 41456
Here is another awk
awk '/http/ {split($3,a,"/");sub(/^.*@/,"",a[3]);$3=a[1]"//"a[3]}8' file
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Upvotes: 1
Reputation: 54392
I think it would probably be better to use a URL parser. For example, Python has: urlparse which can be used to parse URLs into components. Here's some example code, run like:
python3 script.py file
Contents of script.py
:
import sys
import csv
from urllib.parse import urlparse
with open(sys.argv[1], 'r') as csvfile:
r = csv.reader(csvfile, delimiter=' ')
for row in r:
url = urlparse(row[2]);
if (url.scheme and url.hostname):
row[2] = url.scheme + "://" + url.hostname
print(' '.join(row))
Results:
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://www.facebook.com col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Upvotes: 5
Reputation: 77095
With GNU awk
you can do:
gawk '$3~/http/{$3=gensub(/([^/]+)\/\/([^/]+).*/,"\\1//\\2","g",$3);gsub(/\/\/.*www/,"//www",$3)}1' file
$ cat file
line1 col1-1 http://www.google.com/index.html col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected]/pp/index.html col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
$ awk '$3~/http/{$3=gensub(/([^/]+)\/\/([^/]+).*/,"\\1//\\2","g",$3);gsub(/\/\/.*www/,"//www",$3)}1' file
line1 col1-1 http://www.google.com col3-1 col4 col5 col6 col7 col8
line2 col1-2 https://user:[email protected] col3-2 col4 col5 col6 col7 col8
line3 col1-3 badColumn col3-3 col4 col5 col6 col7 col8
Upvotes: 1