Mike
Mike

Reputation: 2391

regex to match more than one space in FrontPage 2003

I use FrontPage 2003, and I want to use a regular expression that finds spaces (more than one space). And that ignores 1 space, and but matches only more than one space (in the text, not finding in the html code).

Upvotes: 0

Views: 601

Answers (4)

Just to update... I was working with FrontPage's weird RegEx recently and remembered some question on stackoverflow, so I looked it up. OK, FP's RegEx is really buggy, so something you'd search with about 8 chars in PCRE, you'd better spell out the long way in FP to avoid problems.

To find 2 or more adjacent spaces in the source code section of FrontPage you need to look for   in repetition OR, with a space before or after it. To create the right search/replace to get most of it, we need to remember:

First, FP's editor converts any series of more than one space to just one space (usually) PREceded by a repeated number of   so that the total number of spaces is the same, but what appears in source code looks like this:

       (note actual white-space space is at the end)

That's what FP editor would do with 7 spacebar taps.

Second, note that if you add a bunch of spaces (or even just ONE) adjacent to a group like the above group FP editor has created, it could add it as a normal white space char, or it could add it as an   -- depending on what it's adjacent to.

Thus you can easily end up with repeating and alternating whitespaces and   characters as you add multiple spaces over time using the WYSIWIG editor and then FP converts the new ones you add into a mix of   and whitespace chars, just appending them to whatever other spaces are there, converting a whitespace to an   ONLY IF the newly added spaces cause two whitespaces to be next to each other. FrontPage never really goes over a whole file to find strings of spaces composed of both whitespaces and   chars in alternation, so you can have a jumble of both in one big, long string. They will alternate, and in that alternation, there can be repeated   chars, but likely not repeated ascii white spaces.

So, to construct a FrontPage RegEx (a link to a good page explaining the differences in FPRE (LOL) is in my prior post here) -- you need to find any 2 adjacent spaces in any of the four forms:

  (most common, that's   followed by a space)

...or the reverse of that:   (space then an   )

...or 2 or more repeating   codes:    with no white spaces

...or, RARELY, two normal spaces: ...rare since FP editor somehow didn't remove -- NOTE usually that's because they're not in visible HTML text but inside HTML tags or scripts or something, so replacing them with just one probably won't mess up such elements, but be aware.

OK, LET'S DO IT...

We need to construct a RegEx (using FrontPage's weird and buggy RegEx) FIND/REPLACE.

Our FIND must REQUIRE the existence of at least one of those above four possible arrangements of two spaces adjacent: either two of the same type adjacent, or one of each adjacent. Else the RegEx pattern must -not- be matched by the text in order to avoid replacing SINGLE spaces which is probably harmless, but why do unnecessary stuff?

For this pattern of 2 adjacent spaces which can occur in 4 possible arrangements (above) we'll use a set of round braces (we don't need to capture here - captures in weird FP RegEx are done with curly braces, btw).

Inside those round braces we will put all 4 possible patterns that match, and separate them each with a pipe | to indidate an "OR" command. Then after the close-brace, we'll put a + delimiter to say we need to find AT LEAST one of these 4 combos to have a pattern-match. (Again, otherwise we're not dealing with 2 or more spaces, and we skip it.)

Then since ANY type of space could come before or after our matching pattern -- assuming a big messy long string of alternating types of spaces is present, and those are common in FrontPage -- we will add some optional alternating space types, both using normal space and  , to our search string and after each we will use a * delimiter, saying there can be 0 or more of these things, and if there are, they MATCH. We will put a series of these before our all-important set in round braces, and another series after them. Why? To grab absolutely as many adjacent spaces as we can and do as FEW find/replace operations without having to repeat find/replace operations to get out all the really long messy stuff.

So if we make the search like this, no matter HOW MESSY your page has gotten over time, it's UN-likely you will need to run this more than once on a page, or on a whole site, unless you have a really huge space-mess, in which case, just RUN IT AGAIN. Guaranteed the 2nd time will get it all. (I tried it on a really messy FP page...it gets 'em all.)

Here it is. Yes, we could shorten it a lot in PCRE, and maybe even in FrontPage, but DO NOT TRY, since FP RegEx is buggy and it will miss things or over-select things or worse if you make it think too much, it will just lock up or crash FP.

Find and Replace FIND: YES-Find in Source Code YES-Use Regular Expressions NO-match case NO-find whole word only

*(\&nbsp\;)* *(\&nbsp\;)* *( \&nbsp\;|\&nbsp\; | \&nbsp\;\&nbsp\;)+ *(\&nbsp\;)* *(\&nbsp\;)* *

(btw, the above string starts with a plain white space -- ASCII Hex 20. It ends with the asterisk.)

REPLACE with:   (has no leading nor trailing spaces)

Run it and you're done. Try it on one messy page first to be sure.

This was created on FrontPage 2003 which is a bit better with FP RegEx than older versions, but FP 2002 is about the same.

Yeah, it's big and ugly, but again, don't over-think FP RegEx, and don't make IT do any thinking or it will just crash on you or screw up the search/replace. Just use that big long ugly thing and be done.

Good luck. This WILL work.

Upvotes: 1

Frontpage doesn't allow two consecutive spaces in its code editor -- the Frontpage editor automatically changes the second and any continuing consecutive space(s) to   (ASCII Non-Breaking Space) in the html code.

And it does that without asking you, during editing, or even if it's just doing its "housekeeping" of a site re-calc, or other tasks and finds a double-space you may have added with a different editor.

BUT, it will allow you to intersperse spaces with   elements if you edit it that way in the code box, or at least it did up through the latest release of FP2003, so it's probably still that way.

ALSO, it will allow a space at the end of a line in the editor, then on next line in the editor, it will allow another space, and it usually will not convert either to   -- but that sometimes varies based on the editor's perception of need to convert those. For example, it's more intrusive at converting multiple spaces, even when line-separated in the code, if it's within span tags or div tags or sometimes if in a table cell (especially if nested).

FrontPage was built to be intrusive and keep you from doing things it thinks you shouldn't be doing according to the loose html standards of its day. (Yes, it barely met those meager standards, and was/is messy with tags and styles, but it did try, and what it produces is usually fully viewable on most browsers, even today.)

So you'll want to find: (space)  or the reverse order of that, and replace them all with   to be safe.

You can do this in FrontPage's own Search/Replace and check _IGNORE WHITE SPACE so it spans across breaks in lines of HTML code and spans over tabs in the code that are for coding ease only.

If you're clever with your RegEx, you can make a capture group that finds either of these ways to express a space where you set the minimum to 2, and maximum as high as you want... say 200.

That looks something like this (typing this in a hurry, so not probably exactly right, but you get the idea) if you did it in "normal" PCRE RegEx:

([ \&nbsp\;]{2,200})

But FrontPage has its own RegEx, which has changed with each new release of FrontPage, just to make it interesting. The brackets are almost all used differently than in PCRE RegEx, and I think the way you select a capture group requires different placement of the grouping symbols.

FrontPage's own set of special characters for RegEx are now sparsely documented as sites disappear, but here is one reference of many:
http://www.softpanorama.org/Office/Frontpage/regular_expressions.shtml

The Microsoft Office link to these special RegEx rules is dead, but I'm putting it below, and it's still linked-to by many pages on MSDN and elsewhere in MS's various help fora:

http://office.microsoft.com/assistance/preview.aspx?assetid=hp030923241033&ctt=4&origin=ch063729491033

Remember you can always use an editor that lets you use more standard RegEx to go through the files and find and replace all of that, without worrying about bowing to FrontPage's quirky RegEx rules. Such editors include:

NoteTabPro, Note++, jEdit, UltraEdit, ...TONS MORE

Just remember that if you edit FrontPage pages from OUTSIDE the FrontPage Software, you need to go to TOOLS > RECALCULATE HYPERLINKS after you're done and re-enter the software. Doing that is not essential if you didn't change any links -- since basically doing that operation just updates the "shadow" .htm(l) file for each altered file -- the shadow file lives in the /_vti_cfg/ subdirectory of the directory where the file resides and it mostly just keeps tracks of links INSIDE the actual .htm(l) file. Then that information is gathered and site-maps/link-maps/navigation-maps are recorded in the website's root directory's /_vti_pvt/ directory where it builds a HUGE list of links (bi-directionally) in files with names like: linkinfo.btr and doctodep.btr and deptodoc.btr.

The above ^^^ is very important to do (recalc links) even if you think you did not disturb any hyperlinks when you played with your files outside of FrontPage and here's why:

Even if you're publishing via FTP-only, with -no- FP Server Extensions, you still need those files up to date. When you publish by FTP, technically, FrontPage sees this as using DTI (Design-Time-Includes) rather than Server Includes, so it pre-merges your headers/footers, etc, and adjusts many location-relative-relational hyperlinks and it does it all before putting up the page. So you still need these /_vti_whatever/ directories and their various files on your design-side (MS-Win PC) to handle whatever FrontPage features are still viable even without the (dangerous!) FPSEs being on your server.

(Tangential, but valuable: If you work with big sites (still) in FrontPage and upload by FTP, you need to do these recalcs fast and create the site reports fast, so find one of those pages that shows you how to MkLink or otherwise set your \Cache\IE to a location on a RAMdisk or at least an SSD drive because then \Cache\IE\FrontPageTempDir is on a much faster drive. When I dust off FP2003 and use it to update an old site with 60,000 files and 5.2 million links, I have it set to cache on the RAMdisk and it recalculates in just a couple minutes, versus a couple hours the old way. Page reports same speed or faster, even when the result is a list of 5 million links or 60k files.)

Either way, always: Tools > Recalculate Hyperlinks after you've changed any files outside of the FrontPage client software.

One last though on deleting repeat spaces -- in the PUBLISH SETTINGS there was a "remove duplicate spaces" checkbox somewhere near where you can select to "Optimize Published HTML." That exact checkbox may have gone away after FP2002, or maybe eliminating dupe spaces just got built-in to the "Optimize" option as a non-changeable default. You can test that on your version.

People may laugh at FrontPage, and the HTML its editor now creates is problematic, but it's quick and handles a lot of files, and still works fine when you don't want to migrate. The HTML code it creats isn't remotely close to up-to-date, and nested tables can display all weird, especially in Firefox, and often in Chrome... BUT, you can migrate to the nearly identical (now free, and old and unsupported) Microsoft ExpressionWeb 4. Then you can choose your HTML standard including XHTML-transitional or HTML5 (the former works better). But in so doing, you lose a LOT of what you had in FrontPage for reports, drag-and-drop, and a bunch of other stuff. You gain non-editable regions that can be finicky, but functional, and you end up less overhead for cleaner uploads.

Summary: Don't bother too much with trying to do this within FrontPage. Do it from an editor that can handle it quickly, then run your recalc. Should be OK, other than the fact that the visual appearance of sites in the lower half (WYSIWYG part) of the FP editor often depend on multiple spaces for showing layout, but then... hey, 2003 was 15 years ago now. :-)

Best to you.

Upvotes: 0

Bert te Velde
Bert te Velde

Reputation: 853

I'm not familiar with FrontPage and Notepad++ and the regex engines you may/must use in their contexts, so I'll confine myself to a few general remarks.

To find matches (two spaces or more) in text, but not within the html tags (i.e. between < and >), you may use a regex pattern like:

<.*?>|(?<spaces>\s{2,})

If there is an issue with the {n,} specifier in your regex engine, you can replace \s{2,} by \s\s+

Furthermore, if < and/or > are special (meta)characters in your regex engine, you'll need to espace them. (Again, I'm not familiar with the FrontPage and Notepad++ environment.)

Upvotes: 0

Sly
Sly

Reputation: 1175

You can use the regex / {2,}/ to match 2 or more spaces. Not sure how regular expressions work in FrontPage, since I don't use it, so I can't really give anymore detail than that.

Upvotes: 0

Related Questions