Akshay Sharma
Akshay Sharma

Reputation: 65

Grep a regex pattern from file which starts with certain pattern

I am trying to build a shell script that will read a file (scope.txt) using while loop. The scope file contains website domains. The loop will iterate through the scope.txt file and searches for that domain in another file named urls.txt. I need to grep the pattern in urls.txt file and in need the result like mentioned in the last.

The scope file contains -

google.com
facebook.com

The URLs file contents -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com

The output I need -

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

Because the resulting output contains all the domains and subdomains of a specific domain which is mentioned in scope.txt file.

I tried to build a shell script file, but not getting desired output The shell script contents -

while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt

Upvotes: 2

Views: 205

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133458

With your shown samples, please try following.

awk '
FNR==NR{
  arr[$0]
  next
}
{
  for(key in arr){
    if($0~/^https?:\/\// && $0 ~ key"/"){
      print
      next
    }
  }
}
' scope urlfile

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition which will be TRUE when scope file.
  arr[$0]              ##Creating array arr with index of current line.
  next                 ##next will skip all further statements from here.
}
{
  for(key in arr){     ##Traversing through array arr here.
    if($0~/^https?:\/\// && $0 ~ key"/"){  ##Checking if line starts from http/https AND contains key/ here then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' scope urlfile        ##Mentioning Input_file names here.

Upvotes: 3

anubhava
anubhava

Reputation: 784998

You may use this grep + sed solution:

grep -Ef <(sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt) urls.txt

https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf

Output of sed command is to build a proper regex that we are using in grep:

sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt

^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com

Upvotes: 4

Related Questions