frankm
frankm

Reputation: 35

Rewrite YAML frontmatter with regular expression

I want to convert my WordPress website to a static site on GitHub using Jekyll.

I used a plugin that exports my 62 posts to GitHub as Markdown. I now have these posts with extra frontmatter at the beginning of each file. It looks like this:

---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
  https://myurl.com/slug
published: true
sw_timestamp:
  - "399956"
sw_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
  - "408644"
swp_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
  - '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
  - "410228"
---

This block isn't parsed right by Jekyll, plus I don't need all this frontmatter. I would like to have each file's frontmatter converted to

---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---

I would like to do this with regular expressions. But my knowledge of regex is not that great. With the help of this forum and lots of Google searches I didn't get very far. I know how to find the complete piece of frontmatter but how do I replace it with a part of it as specified above?

I might have to do this in steps, but I can't wrap my head around how to do this.

I use Textwrangler as the editor to do the search and replace.

Upvotes: 2

Views: 1424

Answers (6)

Alejandro Alcalde
Alejandro Alcalde

Reputation: 6220

You also can use python-frontmatter:

import frontmatter
import io
from os.path import basename, splitext
import glob

# Where are the files to modify
path = "*.markdown"

# Loop through all files
for fname in glob.glob(path):
    with io.open(fname, 'r') as f:
        # Parse file's front matter
        post = frontmatter.load(f)
        for k in post.metadata:
           if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
        del post[k]

        # Save the modified file
        newfile = io.open(fname, 'w', encoding='utf8')
        frontmatter.dump(post, newfile)
        newfile.close()

If you want to see more examples visit this page

Hope it helps.

Upvotes: 1

Anthon
Anthon

Reputation: 76742

YAML (and other relatively free formats like HTML, JSON, XML) is best not transformed using regular expressions, it is easy to work for one example and break for the next that has extra whitespace, different indentation etc.

Using a YAML parser in this situation is not trivial, as many either expect a single YAML document in the file (and barf on the Markdown part as extraneous stuff) or expect multiple YAML documents in the file (and barf because the Markdown is not YAML). Moreover most YAML parser throw away useful things like comments and reorder mapping keys.

I have used a similar format (YAML header, followed by reStructuredText) for many years for my ToDo items, and use a small Python program to extract and update these files. Given input like this:

---
ID: 51     # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
  https://myurl.com/slug
published: true
sw_timestamp:
  - "399956"
sw_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
  - "408644"
swp_open_thumbnail_url:
  - >
    https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
  - '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
  - "410228"
---
additional stuff that is not YAML
  and more
  and more

And this program ¹:

import sys
import ruamel.yaml

from pathlib import Path


def extract(file_name, position=0):
    doc_nr = 0
    if not isinstance(file_name, Path):
        file_name = Path(file_name)
    yaml_str = ""
    with file_name.open() as fp:
        for line_nr, line in enumerate(fp):
            if line.startswith('---'):
                if line_nr == 0:  # don't count --- on first line as next document
                    continue
                else:
                    doc_nr += 1
            if position == doc_nr:
                yaml_str += line
    return ruamel.yaml.round_trip_load(yaml_str, preserve_quotes=True)


def reinsert(ofp, file_name, data, position=0):
    doc_nr = 0
    inserted = False
    if not isinstance(file_name, Path):
        file_name = Path(file_name)
    with file_name.open() as fp:
        for line_nr, line in enumerate(fp):
            if line.startswith('---'):
                if line_nr == 0:
                    ofp.write(line)
                    continue
                else:
                    doc_nr += 1
            if position == doc_nr:
                if inserted:
                    continue
                ruamel.yaml.round_trip_dump(data, ofp)
                inserted = True
                continue
            ofp.write(line)


data = extract('input.yaml')
for k in list(data.keys()):
    if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
        del data[k]

reinsert(sys.stdout, 'input.yaml', data)

You get this output:

---
ID: 51     # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
additional stuff that is not YAML
  and more
  and more

Please note that the comment on the ID line is properly preserved.


¹ This was done using ruamel.yaml a YAML 1.2 parser, which tries to preserve as much information as possible on round-trips, of which I am the author.

Upvotes: 2

lijn
lijn

Reputation: 26

Editing my post because I misinterpreted the question the first time, I failed to understand that the actual post was in the same file, right after the ---

Using egrep and GNU sed, so not the bash built-in, it's relatively easy:

# create a working copy
mv file file.old
# get only the fields you need from the frontmatter and redirect that to a new file
egrep '(---|ID|post_title|author|post_date|layout|published)' file.old > file
# get everything from the old file, but discard the frontmatter
cat file.old |gsed '/---/,/---/ d' >> file
# remove working copy
rm file.old

And if you want it all in one go:

for i in `ls`; do mv $i $i.old; egrep '(---|ID|post_title|author|post_date|layout|published)' $i.old > $i; cat $.old |gsed '/---/,/---/ d' >> $i; rm $i.old; done

For good measure, here's what I wrote as my first response:

===========================================================

I think you're making this way too complicated.

A simple egrep will do what you want:

egrep '(---|ID|post_title|author|post_date|layout|published)' file

redirect to a new file:

egrep '(---|ID|post_title|author|post_date|layout|published)' file > newfile

a whole dir at once:

for i in `ls`; do egrep '(---|ID|post_title|author|post_date|layout|published)' $i > $i.new; done

Upvotes: 1

Rogier Brussee
Rogier Brussee

Reputation: 11

You basically want to edit the file. That is what sed (stream editor) is for.

sed -e s/^ID:(*)$^post_title:()$^author:()$^postdate:()$^layout:()$^published:()$/ID:\1\npost_title:\2\nauthor:\3\npostdate:\4\nlayout:\5\npublished:\6/g

Upvotes: 1

terafl0ps
terafl0ps

Reputation: 704

You could do it with gawk like this:

gawk 'BEGIN {RS="---"; FS="\000" } (FNR == 2) { print "---"; split($1, fm, "\n");  for (line in fm) { if ( fm[line] ~ /^(ID|post_title|author|post_date|layout|published):/) {print fm[line]}  }  print "---"   } (FNR > 2) {print}' post1.html > post1_without_frontmatter_fields.html

Upvotes: 1

Josef Kufner
Josef Kufner

Reputation: 2989

In cases like yours it is better to use actual YAML parser and some scripting language. Cut off metadata from each file to standalone files (or strings), then use YAML library to load the metadata. Once the metadata are loaded, you can modify them safely with no trouble. Then use serialize method from the very same library to create a new metadata file and finally put the files back together.

Something like this:

<?php
list ($before, $metadata, $after) = preg_split("/\n----*\n/ms", file_get_contents($argv[1]));
$yaml = yaml_parse($metadata);
$yaml_copy = [];
foreach ($yaml as $k => $v) {
    // copy the data you wish to preserve to $yaml_copy
    if (...) {
        $yaml_copy[$k] = $yaml[$k];
    }
}
file_put_contents('new/'.$argv[1], $before."\n---\n".yaml_emit($yaml_copy)."\n---\n".$after);

(It is just an untested draft with no error checks.)

Upvotes: 1

Related Questions