Reputation: 25
I have .html files in directories and subdirectories. I need to extract all strings that starts with "domain.com". Part of string can look like this:
["https://example.com/folder1",
href="https://example.com/anotherfolder2" target="
etc.
What I want to extract is:
folder1
anotherfolder2
etc.
from all files in all folders to one list, each word - new line.
Found some examples on StackOverflow with many likes, but not worked. I tried like this (from some examples):
grep -Po '(?<=example.com=)[^,]*'
Thank you for help!
Upvotes: 2
Views: 1043
Reputation: 614
echo "https://example.com/folder1" | tr -s '/' | tr '/' '\n' > file
sed -i '1d' file
sed -n '1p' file # This will give you example.com
sed -n '2p' file # This will give you folder1
sed -i 1s'@example\[email protected]@' file
echo "http://" > nf
sed -n '2,$p' file >> nf
cat nf | tr '\n' '/' > newfile
cat newfile # This should be http://newsite.com/folder1
rm -v ./nf
Upvotes: 0
Reputation: 3845
grep "example.com" your-directory -r | grep -o '".*"' | cut -d \" -f2| sed -e 's/https:\/\/example.com\///g'
grep "example.com" your-directory -r | grep -o '".*"' your-directory -r | cut -d \" -f2
extracts the content of quoted stringsed -e 's/https:\/\/example.com\///g'
get the suffix of https://example.com/
Upvotes: 1