Reputation: 27
I need some help: looking for a way to remove everything after the nth occurrence (most likely 4th or 5th) of "/" in a hyperlink using command like that
cat text.txt | grep -o "^((?:[^/]*/){5}).*$"
This command is not working for me. For example, if I have
https://www.forbes.com/forbes/welcome/?toURL=https://forbes.com/&refURL=&referrer=
My desired output is:
https://www.forbes.com/forbes/welcome/
Additionally, if a link only has < 4 /
, I'd like to keep everything.
Upvotes: 1
Views: 349
Reputation: 3845
If the ?
question mark can be where to exclude from, you could try:
cut -d '?' -f1 input_file
Upvotes: 0
Reputation: 2865
awk 'NF<_||NF=_' FS=/ OFS=/ \_=5
https://www.forbes.com/forbes/welcome
Upvotes: 1
Reputation: 163467
You can match the protocol, and if available use grep -P
repeating a non capture group matching 3 times /
after it:
grep -oP "^https?://(?:[^/]*/){3}" text.txt
Or grep -E
repeating a capture group:
grep -oE "^https?://([^/]*/){3}" text.txt
Or just grep -o
with the right escapes:
grep -o "^https\?://\([^/]*/\)\{3\}" text.txt
Example
echo "https://www.forbes.com/forbes/welcome/?toURL=https://forbes.com/&refURL=&referrer=" | grep -oP "^https?://(?:[^/]*/){3}"
Output
https://www.forbes.com/forbes/welcome/
Note that you don't have to use cat text.txt |
Upvotes: 3
Reputation: 785611
You can use this grep
that would work in any version of grep
:
grep -oE '([^/]*/){5}' file
https://www.forbes.com/forbes/welcome/
Similarly this sed
would also work:
sed -E 's~(([^/]*/){5}).*~\1~' file
https://www.forbes.com/forbes/welcome/
Both these solutions will grab first 5 tokens delimited by /
.
Upvotes: 2
Reputation: 133640
1st solution: With awk
please try following. It should cover both scenarios where /?
OR ?
is coming in URLs(which could be the case in an actual request). Simply making field separator as /?\\?
for all the lines of your Input_file and printing 1st field of line if line starts from either http
OR https
.
awk -F'/?\\?' '/^https?:\/\//{print $1}' Input_file
2nd solution: With GNU awk
with using its match
function please try following solution, little complex compare to first solution but you can try it in case you need more values to be checked apart from getting values before ?
at that time it can help you since it saves values into array.
awk 'match($0,/^(https?:\/\/([^?]*))\?/,arr1){print arr1[1]}' Input_file
Upvotes: 5
Reputation: 11237
Assuming the ?
question mark can be where to exclude from, you can try this sed
$ sed 's/?.*//' input_file
https://www.forbes.com/forbes/welcome/
Upvotes: 3