creed
creed

Reputation: 182

Column and line based modification using bash

I have one file that contains 1,000,000 lines as shown in "input" below. I want to trim the 3rd column which contains from domains, URLs, URNs, URIs to just a domain name. I want to do it on Debian using bash.

Input:

k285N2YBqOyRFMJBdrP0,10,someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP1,10,http://www.someurl.ds:2331,This is bad url_feed k285N2YBqOyRFMJBdrP2,10,https://someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP3,10,someurl.ds/dsadsa/http/test/com,This is bad url_feed k285N2YBqOyRFMJBdrP4,10,a.b.c.d.someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed k285N2YBqOyRFMJBdrP5,10,anythingelse.someurl.ds/dsadsa/dsadsads.exe/,This is bad url_feed

Output:

k285N2YBqOyRFMJBdrP0,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP1,10,www.someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP2,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP3,10,someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP4,10,a.b.c.d.someurl.ds,This is bad url_feed k285N2YBqOyRFMJBdrP5,10,anythingelse.someurl.ds,This is bad url_feed

I am reducing the 3rd column as I want with:

cat test3.txt | cut -d"," -f3 | sed -E -e 's_.*://([^/@]*@)?([^/:]+).*_\2_' | cut -d "/" -f1

How can I extract the 3rd column, modify and return on the place?

Upvotes: 1

Views: 40

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You may use the following awk:

awk 'BEGIN { OFS=FS="," } { sub(/.*:\/\/([^\/@]*@)?/, "", $3); sub(/[\/:].*/, "", $3); print; }' file > outfile

Here,

  • BEGIN { OFS=FS="," } will set field separator to ,
  • sub(/.*:\/\/([^\/@]*@)?/, "", $3) will remove the part of Column 3 value at the start that you do not need
  • sub(/[\/:].*/, "", $3) will remove trailing part of Column 3 value you do not need

Note that instead of print command, you may use 1 after } (this is the same thing in the end, it prints the current record): 'BEGIN { OFS=FS="," } { sub(/.*:\/\/([^\/@]*@)?/, "", $3); sub(/[\/:].*/, "", $3); }1'.

See an online demo.

Upvotes: 3

Related Questions