badaboum
badaboum

Reputation: 31

Renaming HTML files using <title> tags and publication date

I would like to rename HTML files using the HTML title tags & the publication date.

I found an answer for the title using the title tags here:

Edited Code:

#!/bin/bash
for f in $(find . -type f | grep \.html)
   do
   title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
   mv -i "$f" "${title//[ ]/-}".html
done

Renaming HTML files using <title> tags

not sure how to get the date part working.

The result would be title_PublicationDate.html

Here is an example url: https://www.voltairenet.org/article178442.html

That would be renamed from "article178442.html" to "You Are The Hope by Paul Craig Roberts_20130508.html"

Here are the lines that have the publication date in them:

Line 19:

<meta property="og:article:published_time" content="2013-05-07T23:31:07Z" />

Line 207:

<span class="updated" title="2013-05-8"><time datetime="2013-05-08 02:31:07" pubdate>| 8 May 2013</time></span></span>

Edit: would it be possible to do this with python and beautiful soup using the Open Graph tags?

Upvotes: 0

Views: 611

Answers (2)

Frikk Ormestad Larsen
Frikk Ormestad Larsen

Reputation: 425

I have spent some time doing some guesswork, your question isn't very clear, but I arrived at this solution:

I have made a git repo called renamer. This quick little script can be run from inside the folder with all your index files, and it will fetch all the .html files in the folder and rename them to the text inside the <title> tag of the respective html file plus the date it was published, which I fetched from one of the meta tags in the html file. This script assumes the structure of the meta tag in your html element, so it should work as long as the meta tag structure stays the same.

NOTE: The files you download from github have to be placed inside a folder which is inside the directory of your html files (see the images below)

You probably already know this, but when you have placed the files in the folder, run npm install in the folder of the program. This will install all the required dependencies, which are in this case file-system, path and readline.

To run the program type node main in the folder of the program, and just like that, all the html files in the parent folder will be renamed.

The files before the program has been engaged

The files after the program has been engaged

The command to run inside of the folder containing my program

The folder in which the downloaded files should be located

In theory, you should be able to keep the program in the folder where your index files are located for future use as well. Very handy if you ever quickly need to change some names in the future :)

Hope I answered your question.

Upvotes: 1

dash-o
dash-o

Reputation: 14493

You can extra the pub date using sed from the meta tag:

<meta property="og:article:published_time" content="2013-05-07T23:31:07Z" />

Using sedscript to extra pub_date in YYYYMMDD

pub_date=$(sed -n -e 's/.*meta property.*published_time.*\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\).*/\1\2\3/p')

The code is 'opportunistic', and assume very specific structure of the meta tag. However, for quick and dirty solution - it will work. Use $pub_date to construct the file name

Upvotes: 1

Related Questions