Reputation: 27

How do I delete everything after the 3rd 4rth occurrence of a character using sed/grep/regex

I need some help: looking for a way to remove everything after the nth occurrence (most likely 4th or 5th) of "/" in a hyperlink using command like that

cat text.txt | grep -o "^((?:[^/]*/){5}).*$"

This command is not working for me. For example, if I have

https://www.forbes.com/forbes/welcome/?toURL=https://forbes.com/&refURL=&referrer=

My desired output is:

https://www.forbes.com/forbes/welcome/

Additionally, if a link only has < 4 /, I'd like to keep everything.

Upvotes: 1

Answers (6)

ramsay

Reputation: 3845

If the ? question mark can be where to exclude from, you could try:

cut -d '?' -f1 input_file

Upvotes: 0

RARE Kpop Manifesto

Reputation: 2865

awk 'NF<_||NF=_' FS=/ OFS=/ \_=5

   https://www.forbes.com/forbes/welcome

Upvotes: 1

The fourth bird

Reputation: 163467

You can match the protocol, and if available use grep -P repeating a non capture group matching 3 times / after it:

grep -oP "^https?://(?:[^/]*/){3}" text.txt

Or grep -E repeating a capture group:

grep -oE "^https?://([^/]*/){3}" text.txt

Or just grep -o with the right escapes:

grep -o "^https\?://\([^/]*/\)\{3\}" text.txt

Example

echo "https://www.forbes.com/forbes/welcome/?toURL=https://forbes.com/&refURL=&referrer=" | grep -oP "^https?://(?:[^/]*/){3}"

Output

https://www.forbes.com/forbes/welcome/

Note that you don't have to use cat text.txt |

Upvotes: 3

anubhava

Reputation: 785611

You can use this grep that would work in any version of grep:

grep -oE '([^/]*/){5}' file

https://www.forbes.com/forbes/welcome/

Similarly this sed would also work:

sed -E 's~(([^/]*/){5}).*~\1~' file

https://www.forbes.com/forbes/welcome/

Both these solutions will grab first 5 tokens delimited by /.

Upvotes: 2

RavinderSingh13

Reputation: 133640

1st solution: With awk please try following. It should cover both scenarios where /? OR ? is coming in URLs(which could be the case in an actual request). Simply making field separator as /?\\? for all the lines of your Input_file and printing 1st field of line if line starts from either http OR https.

awk -F'/?\\?' '/^https?:\/\//{print $1}' Input_file

2nd solution: With GNU awk with using its match function please try following solution, little complex compare to first solution but you can try it in case you need more values to be checked apart from getting values before ? at that time it can help you since it saves values into array.

awk 'match($0,/^(https?:\/\/([^?]*))\?/,arr1){print arr1[1]}' Input_file

Upvotes: 5

sseLtaH

Reputation: 11237

Assuming the ? question mark can be where to exclude from, you can try this sed

$ sed 's/?.*//' input_file
https://www.forbes.com/forbes/welcome/

Upvotes: 3

How do I delete everything after the 3rd 4rth occurrence of a character using sed/grep/regex

Answers (6)

Related Questions