Reputation: 182
I have one file that contains 1,000,000 lines as shown in "input" below. I want to trim the 3rd column which contains from domains, URLs, URNs, URIs to just a domain name. I want to do it on Debian using bash.
Input:
k285N2YBqOyRFMJBdrP0,10,someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP1,10,http://www.someurl.ds:2331,This is bad url_feed k285N2YBqOyRFMJBdrP2,10,https://someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP3,10,someurl.ds/dsadsa/http/test/com,This is bad url_feed k285N2YBqOyRFMJBdrP4,10,a.b.c.d.someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP5,10,anythingelse.someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed
Output:
k285N2YBqOyRFMJBdrP0,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP1,10,www.someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP2,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP3,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP4,10,a.b.c.d.someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP5,10,anythingelse.someurl.ds,This is bad url_feed
I am reducing the 3rd column as I want with:
cat test3.txt | cut -d"," -f3 | sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' | cut -d "/" -f1
How can I extract the 3rd column, modify and return on the place?
Upvotes: 1
Views: 40
Reputation: 626748
You may use the following awk
:
awk 'BEGIN { OFS=FS="," } { sub(/.*:\/\/([^\/@]*@)?/, "", $3); sub(/[\/:].*/, "", $3); print; }' file > outfile
Here,
BEGIN { OFS=FS="," }
will set field separator to ,
sub(/.*:\/\/([^\/@]*@)?/, "", $3)
will remove the part of Column 3 value at the start that you do not needsub(/[\/:].*/, "", $3)
will remove trailing part of Column 3 value you do not needNote that instead of print
command, you may use 1
after }
(this is the same thing in the end, it prints the current record): 'BEGIN { OFS=FS="," } { sub(/.*:\/\/([^\/@]*@)?/, "", $3); sub(/[\/:].*/, "", $3); }1'
.
See an online demo.
Upvotes: 3