Reputation: 31
I would like to rename HTML files using the HTML title tags & the publication date.
I found an answer for the title using the title tags here:
Edited Code:
#!/bin/bash
for f in $(find . -type f | grep \.html)
do
title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
mv -i "$f" "${title//[ ]/-}".html
done
Renaming HTML files using <title> tags
not sure how to get the date part working.
The result would be title_PublicationDate.html
Here is an example url: https://www.voltairenet.org/article178442.html
That would be renamed from "article178442.html" to "You Are The Hope by Paul Craig Roberts_20130508.html"
Here are the lines that have the publication date in them:
Line 19:
<meta property="og:article:published_time" content="2013-05-07T23:31:07Z" />
Line 207:
<span class="updated" title="2013-05-8"><time datetime="2013-05-08 02:31:07" pubdate>| 8 May 2013</time></span></span>
Edit: would it be possible to do this with python and beautiful soup using the Open Graph tags?
Upvotes: 0
Views: 611
Reputation: 425
I have spent some time doing some guesswork, your question isn't very clear, but I arrived at this solution:
I have made a git repo called renamer. This quick little script can be run from inside the folder with all your index files, and it will fetch all the .html files in the folder and rename them to the text inside the <title>
tag of the respective html file plus the date it was published, which I fetched from one of the meta tags in the html file. This script assumes the structure of the meta tag in your html element, so it should work as long as the meta tag structure stays the same.
NOTE: The files you download from github have to be placed inside a folder which is inside the directory of your html files (see the images below)
You probably already know this, but when you have placed the files in the folder, run npm install
in the folder of the program. This will install all the required dependencies, which are in this case file-system
, path
and readline
.
To run the program type node main
in the folder of the program, and just like that, all the html files in the parent folder will be renamed.
In theory, you should be able to keep the program in the folder where your index files are located for future use as well. Very handy if you ever quickly need to change some names in the future :)
Hope I answered your question.
Upvotes: 1
Reputation: 14493
You can extra the pub date using sed from the meta tag:
<meta property="og:article:published_time" content="2013-05-07T23:31:07Z" />
Using sed
script to extra pub_date in YYYYMMDD
pub_date=$(sed -n -e 's/.*meta property.*published_time.*\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\).*/\1\2\3/p')
The code is 'opportunistic', and assume very specific structure of the meta tag. However, for quick and dirty solution - it will work. Use $pub_date to construct the file name
Upvotes: 1