MKorona
MKorona

Reputation: 295

Extract domains from one column while keeping other columns

I have a file with three columns that looks like that:

0       1612291061      http://www.staropolska.pl/
0       1612450417      http://m.kerygma.pl/
6831926761338023936     1612171787      http://www.kerygma.pl/hermeneutyka-biblijna/377-ksiegi-starego-testamentu-mini-streszczenie
6867871457052077056     1612534199      http://www.kerygma.pl/katechizm-kkk/kkk-iv-modlitwa/538-kkk-2558-2565

I want to extract domains from the third column whilst keeping the first two columns, so I want to have a file that looks like that:

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

So far I am able to extract domains using grep:

cat file.txt | grep -Eo '(http|https)://[^/"]+'

but this gives me only domains from third column:

http://www.staropolska.pl
http://m.kerygma.pl
http://www.kerygma.pl
http://www.kerygma.pl

without printing the first two.

Upvotes: 11

Views: 310

Answers (4)

The fourth bird
The fourth bird

Reputation: 163632

Another option using gawk and gensub and use the capture group \\1 in the replacement:

gawk '{
  print gensub(/(https?:\/\/[^/"]+).*/, "\\1", "g", $0);  
}
' file

Output

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133770

With your shown samples in awk, could you please try following.

awk 'match($0,/.*http[s]?:\/\/[^/]*/){print substr($0,RSTART,RLENGTH)}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                ##Starting awk program from here.
match($0,/.*http[s]?:\/\/[^/]*/){    ##Using match function to match regex from starting to till http/https:// till next / here.
  print substr($0,RSTART,RLENGTH)    ##Printing sub string of matched regex here.
}
' Input_file                         ##Mentioning Input_file name here.

Upvotes: 6

Christian Fritz
Christian Fritz

Reputation: 21384

Another option is cut, using / as delimiter:

$ cat file.txt | cut -d '/' -f 1-3
0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

Upvotes: 7

anubhava
anubhava

Reputation: 786329

You just need to allow grep regex to match anything before https?://:

grep -Eo '.*[[:blank:]]https?://[^/"]+' file

0       1612291061      http://www.staropolska.pl
0       1612450417      http://m.kerygma.pl
6831926761338023936     1612171787      http://www.kerygma.pl
6867871457052077056     1612534199      http://www.kerygma.pl

RegEx Explained:

  • .*: Match 0 or more of any characters
  • [[:blank:]]: Match one space or tab character
  • https?: Match https or http
  • ://: Match ://
  • [^/"]+: Match 1+ of any character that is not a / and not a "

Alternatively, you may try this sed as well:

sed -E 's~([[:blank:]]https?://[^/"]+).*~\1~' file

Upvotes: 6

Related Questions