Reputation: 145950
Is there any reason why XML such as this :
<person>
<firstname>Joe</firstname>
<lastname>Plumber</lastname>
</person>
couldn't be compressed like this for client/server transfer.
<person>
<firstname>Joe</>
<lastname>Plumber</>
</>
It would be smaller - and slightly faster to parse.
Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?
This is a hard thing to google it turns out :
Your search -
</>
- did not match any documents.Suggestions:
Try different keywords.
Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.
Upvotes: 5
Views: 706
Reputation: 12187
Is there any reason why
Taking your question philosophically, SGML did allow </>
close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".
It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).
To strip the close tags: simplest is to use something like this: s_</[a-zA-Z0-9_$]+>_</>_
(that's not the right QName regex, but you get the idea).
To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.
have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
insert that in the output (you can do this even when there is a proper close tag).
BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.
However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?
EDIT: You can further omit trailing </>
, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):
<person>
<firstname>Joe</>
<lastname>Plumber
Upvotes: 4
Reputation: 18507
What you are describing is SGML, which uses </>
to end nearest previous nonempty tag.
Upvotes: 3
Reputation: 4703
Do not bother with in-text optimizations of your XML and degrading reading/writing perf/simplicity. Use deflate compression to compress your payload between the client and the server. I made some tests, and compressing a normal 10k XML file results in a 2.5k blub. Removing all endpoint end tag names lowers the original file size to 9k, but once deflated it's again 2.5k. This is a very good example that dictionary-based compression is the simple way to compress payloads between endpoints. "" and "" will (almost) use the same space in the compressed data.
The only exception would be if the files/data is very small, then less compressible.
Upvotes: 0
Reputation: 12488
If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:
<a>
<b>Joe</b>
<c>Plumber</c>
</a>
Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.
Upvotes: 0
Reputation: 625077
That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.
In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:
Upvotes: 5
Reputation: 300845
As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either
Upvotes: 5
Reputation: 49311
If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.
The reasons this isn't done are:
Upvotes: 8
Reputation: 53366
Yes, xml is a kind og heavy format. But it has certain advantages.
If you think xml is to heavy for your use, have a look at JSON instead. It is light weight but has less functionality than xml.
And if you want really small files, use a binary format ;-).
Upvotes: 0
Reputation: 75794
Even if this were possible it could only take longer to parse because now the parser has to work out what's being closed and will have to keep checking if that's correct.
If you want compression, XML is highly gzip'able.
Upvotes: 2
Reputation: 993085
You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:
<p/This paragraph contains a <em/bold/ word./
Fortunately, the designers of XML chose to omit this particular chapter of madness.
Upvotes: 1
Reputation: 64632
If you need better compression and easier parsing, you may try using XML attributes:
<person firstname="Joe" lastname="Plumber" />
Upvotes: 5