Reputation: 11
I have been making a shell script to be able to download a certain experimental branch of Blender from their website. When curling the site all versions appear in a really (and I mean really long) string of all the html together. I can grep (ripgrep spcecifically) only the Linux versions, but when wanting to grep or even sed again, all the filenames start with "https://" and end with ".tar.xz".
And they are all on the same line, so matching the beginning of the first also matches the end of the very last match.
os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">asset-browser-poselib</span><small>May 22, 05:26:55 - asset-browser-poselib - fba8de2e8688 - tar.xz - 149.56MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">asset-browser-poselib</span><small>May 22, 05:26:55 - asset-browser-poselib - fba8de2e8688 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">cycles-x</span><small>May 22, 05:03:02 - cycles-x - a117a9c63c3a - tar.xz - 143.11MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">cycles-x</span><small>May 22, 05:03:02 - cycles-x - a117a9c63c3a - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 12:38:57 - override-recursive-resync - 0d2c5bf06726 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">debug</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 12:38:56 - override-recursive-resync - 0d2c5bf06726 - tar.xz - 157.56MB</small></span><span class="build">x64</span><span class="size">debug</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 11:50:22 - override-recursive-resync - 0d2c5bf06726 - tar.xz - 149.73MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 11:50:22 - override-recursive-resync - 0d2c5bf06726 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">profiler-editor</span><small>May 20, 04:54:26 - profiler-editor - ab200c6eddc6 - tar.xz - 149.54MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz
I trued using ripgrep (or grep):
rg -o 'https.*tar\.xz'
But that is exactly what matches from the first filename all the way to the last, maybe using AND logic in grep could help?
The URL from the string that I want is the following:
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
How could I filter out that specific URL string if they start and end the same?
Upvotes: 0
Views: 137
Reputation: 52211
Here's a way using the CLI HTML parser pup:
curl -s https://builder.blender.org/download/experimental/ \
| pup 'li.linux > a[href*="cycles-x"] attr{href}' \
| grep '\.tar\.xz$'
printing
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
The selector li.linux > a[href*="cycles-x"]
selects <a>
elements that contain cycles-x
in their href
attribute, for all links that are children of a list item with class linux
. The display function attr{href}
prints the value of the href
attribute.
This returns two lines: the URL we want, and the URL for the checksum. CSS supports multiple attribute selectors as in a[href*="cycles-x"][href$=".tar.xz"]
, but pup doesn't – hence the grep
filter.
Upvotes: 1
Reputation: 133590
With GNU grep
using non-greedy matching, we could try following.
grep -oP 'https?:\/\/.*?tar\.xz' Input_file
Explanation: Simply using -o
option to print matched part only, using -P
option to enable PCRE regex with grep here. Then matching from http
OR https
to till tar.xz
using non-greedy match here. It will print all matched values from file.
NOTE: If you are happy with grep
results above, which will print them on terminal and you want to save output into Input_file itself then append > temp && mv temp Input_file
to above code.
Upvotes: 2
Reputation: 3063
You could put a new line after each instance of '.tar.xz' with:
sed -i 's/\.tar\.xz/.tar.xz\n/g' your_file
Then remove everything up to 'https' with:
sed -i 's/.*href="//' your_file
to change the file to this:
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz
Edit: @Wiktor Stribiżew has a better answer
Upvotes: 0
Reputation: 627022
You can use
grep -o 'https[^[:space:]"'"'"']*tar\.xz'
See the online demo.
Details
https
- a https
string[^[:space:]"']*
- zero or more chars other than whitespace, "
and '
tar\.xz
- tar.xz
string.Upvotes: 0