Reputation: 295
I have a file with three columns that looks like that:
0 1612291061 http://www.staropolska.pl/
0 1612450417 http://m.kerygma.pl/
6831926761338023936 1612171787 http://www.kerygma.pl/hermeneutyka-biblijna/377-ksiegi-starego-testamentu-mini-streszczenie
6867871457052077056 1612534199 http://www.kerygma.pl/katechizm-kkk/kkk-iv-modlitwa/538-kkk-2558-2565
I want to extract domains from the third column whilst keeping the first two columns, so I want to have a file that looks like that:
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
So far I am able to extract domains using grep:
cat file.txt | grep -Eo '(http|https)://[^/"]+'
but this gives me only domains from third column:
http://www.staropolska.pl
http://m.kerygma.pl
http://www.kerygma.pl
http://www.kerygma.pl
without printing the first two.
Upvotes: 11
Views: 310
Reputation: 163632
Another option using gawk and gensub and use the capture group \\1
in the replacement:
gawk '{
print gensub(/(https?:\/\/[^/"]+).*/, "\\1", "g", $0);
}
' file
Output
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
Upvotes: 2
Reputation: 133770
With your shown samples in awk
, could you please try following.
awk 'match($0,/.*http[s]?:\/\/[^/]*/){print substr($0,RSTART,RLENGTH)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*http[s]?:\/\/[^/]*/){ ##Using match function to match regex from starting to till http/https:// till next / here.
print substr($0,RSTART,RLENGTH) ##Printing sub string of matched regex here.
}
' Input_file ##Mentioning Input_file name here.
Upvotes: 6
Reputation: 21384
Another option is cut
, using /
as delimiter:
$ cat file.txt | cut -d '/' -f 1-3
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
Upvotes: 7
Reputation: 786329
You just need to allow grep
regex to match anything before https?://
:
grep -Eo '.*[[:blank:]]https?://[^/"]+' file
0 1612291061 http://www.staropolska.pl
0 1612450417 http://m.kerygma.pl
6831926761338023936 1612171787 http://www.kerygma.pl
6867871457052077056 1612534199 http://www.kerygma.pl
RegEx Explained:
.*
: Match 0 or more of any characters[[:blank:]]
: Match one space or tab characterhttps?
: Match https
or http
://
: Match ://
[^/"]+
: Match 1+ of any character that is not a /
and not a "
Alternatively, you may try this sed
as well:
sed -E 's~([[:blank:]]https?://[^/"]+).*~\1~' file
Upvotes: 6