Simon_Weaver
Simon_Weaver

Reputation: 145950

Can Xml be compressed with </> to end elements?

Is there any reason why XML such as this :

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

couldn't be compressed like this for client/server transfer.

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

It would be smaller - and slightly faster to parse.

Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?

This is a hard thing to google it turns out :

Your search - </> - did not match any documents.

Suggestions:

Try different keywords.

Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.

Upvotes: 5

Views: 706

Answers (14)

13ren
13ren

Reputation: 12187

Is there any reason why

Taking your question philosophically, SGML did allow </> close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".

It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).

To strip the close tags: simplest is to use something like this: s_</[a-zA-Z0-9_$]+>_</>_ (that's not the right QName regex, but you get the idea).

To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.

However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?

EDIT: You can further omit trailing </>, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):

<person>    
    <firstname>Joe</>    
    <lastname>Plumber

Upvotes: 4

dalle
dalle

Reputation: 18507

What you are describing is SGML, which uses </> to end nearest previous nonempty tag.

Upvotes: 3

Martin Plante
Martin Plante

Reputation: 4703

Do not bother with in-text optimizations of your XML and degrading reading/writing perf/simplicity. Use deflate compression to compress your payload between the client and the server. I made some tests, and compressing a normal 10k XML file results in a 2.5k blub. Removing all endpoint end tag names lowers the original file size to 9k, but once deflated it's again 2.5k. This is a very good example that dictionary-based compression is the simple way to compress payloads between endpoints. "" and "" will (almost) use the same space in the compressed data.

The only exception would be if the files/data is very small, then less compressible.

Upvotes: 0

Svante Svenson
Svante Svenson

Reputation: 12488

If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.

Upvotes: 0

cletus
cletus

Reputation: 625077

That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.

In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:

  1. It's nonstandard and possibly will have to be supported far in the future;
  2. Standards exist for a reason. Standards and conventions have a lot of power and having "custom XML" ranks up there with Ivory Tower graphic designers who force programmers to write a custom button replacement because the standard one can't do whatever weird, wonderful and confusing behaviour was dreamt up;
  3. Gzip compression is easy and far more effective and won't break standards. If you see a gzip octet stream, there's no mistaking it for XML. The real problem with the shorthand scheme you've got is that it still has at the top so some poor unsuspecting parser may make the mistake of thinking its valid and bomb out with a different, misleading error;
  4. Information theory: compression works by removing redundancy of information. If you do that by hand, it makes gzip compression no more effective because the same amount of information is represetned;
  5. There is a significant overhead on converting documents to and from this scheme. It can't be done with a standard XML parser so you'd have to effectively write your own XML parser and outputter that understands this scheme (actually conversion to this format can be done with a parser; getting it back is more difficult), which is a lot of work (and a lot of bugs).

Upvotes: 5

Paul Dixon
Paul Dixon

Reputation: 300845

As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either

  • Use XML, and compress it on the wire as you'll see far greater savings than with your own scheme
  • Use another more compact format like YAML or JSON

Upvotes: 5

peterchen
peterchen

Reputation: 41096

If size of the data is any issue at all, XML is not for you.

Upvotes: 4

Pete Kirkham
Pete Kirkham

Reputation: 49311

If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.

The reasons this isn't done are:

  • much better XML agnostic compression schemes already exist (in terms of compression ratio, and probably in terms of CPU and space - a certain 7 N UTF-8 document would get 14% compression but require at least 2 N bytes space to decompress, rather than constant space required by most decompression algorithms.
  • much better XML aware compression schemes already exist (google 'binary xml'). For schema aware compression, the schemes based on ASN.1 give much better than reducing the size devoted to indicating element type by half.
  • the decompressor must parse the non-standard XML and keep a stack of the open tags it has encountered. So unless you're plugging it in instead of a parser, you have doubled the parsing cost. If you do plug it instead of the parser, you're mixing a different layers, which is liable to cause confusion at some point

Upvotes: 8

Toon Krijthe
Toon Krijthe

Reputation: 53366

Yes, xml is a kind og heavy format. But it has certain advantages.

If you think xml is to heavy for your use, have a look at JSON instead. It is light weight but has less functionality than xml.

And if you want really small files, use a binary format ;-).

Upvotes: 0

annakata
annakata

Reputation: 75794

Even if this were possible it could only take longer to parse because now the parser has to work out what's being closed and will have to keep checking if that's correct.

If you want compression, XML is highly gzip'able.

Upvotes: 2

Karan
Karan

Reputation: 1676

Is there any reason you aren't using YAML or JSON?

Upvotes: 0

Greg Hewgill
Greg Hewgill

Reputation: 993085

You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:

<p/This paragraph contains a <em/bold/ word./

Fortunately, the designers of XML chose to omit this particular chapter of madness.

Upvotes: 1

Boris Pavlović
Boris Pavlović

Reputation: 64632

If you need better compression and easier parsing, you may try using XML attributes:

<person firstname="Joe" lastname="Plumber" />

Upvotes: 5

raupach
raupach

Reputation: 3102

Sorry, not in the spec. If you have a big XML file you better compress via zip, gzip and such.

Upvotes: 0

Related Questions