Reputation: 119

Extract title from HTML and rename file to title

I have multiple files named output.html. I want to extract their title, which I can do successfully using following command:

cat output.html | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'

Example:

7N8UGL0:~/Downloads$ cat output.html | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'
SEIKO 5 Finder - SNK559 Automatic Watch

Now I want to rename the output.html to the extracted title:

SEIKO 5 Finder - SNK559 Automatic Watch.html

I already managed to put this into a script:

#!/bin/bash
title=`cat output.html | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'`
echo $title

Further, I have a lot of these output.html files in directories named in epoch time format

ls -l
drwxrwxrwx 1 userna userna 512 Aug  7 19:33 1500122724.81
drwxrwxrwx 1 userna userna 512 Aug  7 19:33 1500122724.82
drwxrwxrwx 1 userna userna 512 Aug  7 19:33 1500122724.83
drwxrwxrwx 1 userna userna 512 Aug  7 19:32 1500122724.84
drwxrwxrwx 1 userna userna 512 Aug  7 18:36 1500122724.85
drwxrwxrwx 1 userna userna 512 Aug  7 18:35 1500122724.86

I would like to be able to extract the html title for all output.html in all the directories and rename the output.html accordingly.

Many thanks in advance,

jmt

Upvotes: 0

Answers (2)

jmt

Reputation: 119

I was able to solve this by writing following script:

#!/bin/bash
for file in $(find . -name output.html)
do
newfilename=`cat $file |  sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'`
mv $file "$newfilename.html"
done

It does as follows:

For each file named output.html in location .
set a variable "newfilename" equal to the extracted tag (e.g. "SEIKO 5 Finder - SNK559 Automatic Watch"
rename $file from 1. to value of newfilename. For "$newfilename" I used quotation marks due the the spaces in the file name

Now I want to find a way to identify special characters like /: as I get an error when the HTML title contains any of those.

Upvotes: 0

Nic3500

Reputation: 8621

Use the command find to

process all files (-type f)
of name output.html (-name output.html).
run your rename script on them (-exec rename.bash {} \;).

Find is recursive through each directory.

So the complete command would look like:

find <YOUR TOP DIRECTORY> -type f -name output.html -exec rename.bash {} \; -print

The -print at the end will list all processed files to stdout. Your rename script receives in argument the full path and filename of the output.html it found. So you will have to do your sed command, then a mv from the argument you received to the path/THE-TITLE-VALUE-YOU-JUST-EXTRACTED-WITH-SED.html.

FYI I would suggest you be careful with this renaming. Spaces in filenames, although perfectly "legal" can cause issues later. Make sure also your titles do not include special characters to the shell like *,!(). and many more. All alphanumeric is fine, along with - and _.

Upvotes: 1

Extract title from HTML and rename file to title

Answers (2)

Related Questions