Grep a regex pattern from file which starts with certain pattern

Question

I am trying to build a shell script that will read a file (scope.txt) using while loop. The scope file contains website domains. The loop will iterate through the scope.txt file and searches for that domain in another file named urls.txt. I need to grep the pattern in urls.txt file and in need the result like mentioned in the last.

The scope file contains -

google.com
facebook.com

The URLs file contents -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com

The output I need -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

Because the resulting output contains all the domains and subdomains of a specific domain which is mentioned in scope.txt file.

I tried to build a shell script file, but not getting desired output The shell script contents -

while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt

anubhava · Accepted Answer

You may use this grep + sed solution:

grep -Ef <(sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt) urls.txt

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

Output of sed command is to build a proper regex that we are using in grep:

sed 's/\./\&/g; s~^~^https?://([^.?]+\.)*~' scope.txt

^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com

Grep a regex pattern from file which starts with certain pattern

Answers (2)

Related Questions