Stigma
Stigma

Reputation: 1736

Wiki-fying a text using LPeg

Long story coming up, but I'll try to keep it brief. I have many pure-text paragraphs which I extract from a system and re-output in wiki format so that the copying of said data is not such an arduous task. This all goes really well, except that there are no automatic references being generated for the 'topics' we have pages for, which end up needing to be added by reading through all the text and adding it in manually by changing Topic to [[Topic]].

First requirement: each topic is only to be made clickable once, which is the first occurrence. Otherwise, it would become a really spammy linkfest, which would detract from readability. To avoid issues with topics that start with the same words

Second requirement: overlapping topic names should be handled in such a way that the most 'precise' topic gets the link, and in later occurrences, the less precise topics do not get linked, since they're likely not correct.

Example:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
input = "Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project."
output = function_to_be_written(input)
-- "[[Mary]] and [[Mr. Moore]] work together on [[Project Omega]]. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the [[Project]]."

Now, I quickly figured out a simple or complicated string.gsub() could not get me what I need to satisfy the second requirement, as it provides no way to say 'Consider this match as if it did not happen - I want you to backtrack further'. I need the engine to do something akin to:

input = "abc def ghi"
-- Looping over the input would, in this order, match the following strings:
-- 1) abc def ghi
-- 2) abc def
-- 3) abc
-- 4) def ghi
-- 5) def
-- 6) ghi

Once a string matches an actual topic and has not been replaced before by its wikified version, it is replaced. If this topic has been replaced by a wikified version before, don't replace, but simply continue the matching at the end of the topic. (So for a topic "abc def", it would test "ghi" next in both cases.)

Thus I arrive at LPeg. I have read up on it, played with it, but it is considerably complex, and while I think I need to use lpeg.Cmt and lpeg.Cs somehow, I am unable to mix the two properly to make what I want to do work. I am refraining from posting my practice attempts as they are of miserable quality and probably more likely to confuse anyone than assist in clarifying my problem.

(Why do I want to use a PEG instead of writing a triple-nested loop myself? Because I don't want to, and it is a great excuse to learn PEGs.. except that I am in over my head a bit. Unless it is not possible with LPeg, the first is not an option.)

Upvotes: 2

Views: 420

Answers (2)

daurnimator
daurnimator

Reputation: 4311

So... I got bored and needed something to do:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}

pcall ( require , 'luarocks.require' )
require 'lpeg'
local locale = lpeg.locale ( )
local endofstring = -lpeg.P(1)
local endoftoken = (locale.space+locale.punct)^1

table.sort ( topics , function ( a , b ) return #a > #b end ) -- Sort by word length (longest first)
local topicpattern = lpeg.P ( false )
for i = 1, #topics do
    topicpattern = topicpattern + topics [ i ]
end

function wikify ( input )
    local topicsleft = { }
    for i = 1 , #topics do
        topicsleft [ topics [ i ] ] = true
    end

    local makelink = function ( topic )
        if topicsleft [ topic ] then
            topicsleft [ topic ] = nil
            return "[[" .. topic .. "]]"
        else
            return topic
        end
    end

    local patt = lpeg.Ct ( 
        (
            lpeg.Cs ( ( topicpattern / makelink ) )* #(-locale.alnum+endofstring) -- Match topics followed by something thats not alphanumeric
            + lpeg.C ( ( lpeg.P ( 1 ) - endoftoken )^0 * endoftoken ) -- Skip tokens that aren't topics
        )^0 * endofstring -- Match adfinum until end of string
    )
    return table.concat ( patt:match ( input ) )
end

print(wikify("Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project.")..'"')
print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa.")..'"')

I start off my making a pattern which matches all the different topics; we want to match the longest topics first, so sort the table by word length from longest to shortest. Now we need to make a list of the topics we haven't seen in the current input. makelink quotes/links the topic if we haven't seen it already, otherwise leaves it be.

Now for the actual lpeg stuff:

  • lpeg.Ct packs all our captures into a table (to be concated together for output)
  • topicpattern / makelink captures a topic, and passes in through our makelink function.
  • lpeg.Cs substitutes the result of makelink back in where the match of the topic was.
  • + lpeg.C ( ( lpeg.P ( 1 ) - locale.space )^0 * locale.space^1 ) if we didn't match a topic, skip a word (that is, not spaces followed by a space)
  • ^0 repeat.

Hope thats what you wanted :)

Daurn

Note: Edited code, description no longer correct

Upvotes: 1

phil pirozhkov
phil pirozhkov

Reputation: 4900

So why don't you use string.find? It search only for a first topic occurrence and gives you its starting index and length. All you have to do is to add '[[' to a result. For each chunk, copy the topics table and when the first occurency has been found, remove it. Sort topics by length, most long first so that the most relevant topic will be found first

LPeg is a good tool, but it's not necessary to use it here.

Upvotes: 1

Related Questions