Reputation: 65
I am trying to build a shell script that will read a file (scope.txt) using while loop. The scope file contains website domains. The loop will iterate through the scope.txt file and searches for that domain in another file named urls.txt. I need to grep the pattern in urls.txt file and in need the result like mentioned in the last.
The scope file contains -
google.com
facebook.com
The URLs file contents -
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://test.com/sdvs?url=google.com
https://abcd.com/jhhhh/hghv?proxy=https://google.com
https://a.b.c.d.facebook.com/ss/sdfsdf
http://aa.b.c.d.com/dfgdfg/sgfdfg?url=https://google.com
The output I need -
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
Because the resulting output contains all the domains and subdomains of a specific domain which is mentioned in scope.txt file.
I tried to build a shell script file, but not getting desired output The shell script contents -
while read -r line; do
cat urls.txt | grep -e "^https\:\/\/$line\|^http\:\/\/$line"
done < scope.txt
Upvotes: 2
Views: 205
Reputation: 133458
With your shown samples, please try following.
awk '
FNR==NR{
arr[$0]
next
}
{
for(key in arr){
if($0~/^https?:\/\// && $0 ~ key"/"){
print
next
}
}
}
' scope urlfile
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when scope file.
arr[$0] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
{
for(key in arr){ ##Traversing through array arr here.
if($0~/^https?:\/\// && $0 ~ key"/"){ ##Checking if line starts from http/https AND contains key/ here then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
}
}
' scope urlfile ##Mentioning Input_file names here.
Upvotes: 3
Reputation: 784998
You may use this grep + sed
solution:
grep -Ef <(sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt) urls.txt
https://google.com/ukhkj/sdgdsdd/
http://abcs.google.com/sdf/sg/dfg?ijkl=asffdf
https://a.b.c.d.facebook.com/ss/sdfsdf
Output of sed
command is to build a proper regex that we are using in grep
:
sed 's/\./\\&/g; s~^~^https?://([^.?]+\\.)*~' scope.txt
^https?://([^.?]+\.)*google\.com
^https?://([^.?]+\.)*facebook\.com
Upvotes: 4