Reputation: 301
I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture. Therefore, I use:
Id=$(grep -m 10 -A 1 "data-adid" Textfile)
to get the first ten results. The variable Id contains the following:
<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...
I would like to get the following output:
id="1234567890" id="2134567890" id="3124567890"
When using the grep command, I only managage to get the numbers, e.g.
Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')
gets 1234567890 2134567890 3124567890
When trying
Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')
this will only give me id= id= id=
How could the code be change to get the desired output?
Upvotes: 1
Views: 117
Reputation: 204558
With any sed:
$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"
Upvotes: 0
Reputation: 133760
Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk
for this one. Written and tested in https://ideone.com/EpU1aW
echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
val=substr($0,RSTART,RLENGTH)
sub(/^data-ad/,"",val)
print val
val=""
}
'
Upvotes: 2
Reputation: 141890
data-ad
is matching only data-ad
- actually match the id=
part too, with a "
up until the next "
. And I see no reason to use fancy lookarounds - just match the string and output the matched part only.
grep -oP 'data-ad\Kid="[^"]*"'
Should be enough. Note that $Id
undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.
Upvotes: 2