kevin
kevin

Reputation: 14065

xml and & issue

I am new to XML and I am now trying to read an xml file. I googled and try this way to read xml but I get this error.

Reference to undeclared entity 'Ccaron'. Line 2902, position 9.

When I go to line 2902 I got this,

<H0742>&Ccaron;opova 14, POB 1725,
SI-1000 Ljubljana</H0742>

This is the way I try

XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
            XmlNodeList nodes = xDoc.SelectNodes("nodeName");
            foreach (XmlNode n in nodes)
            {
if (n.SelectSingleNode("H0742") != null)
                {
                    row.IrNbr = n.SelectSingleNode("H0742").InnerText;
                }
                .
                .
                .
            }

When I look at w3school, & is illegal in xml.

EDIT : This is the encoding. I wonder it's related with xml somehow.

encoding='iso-8859-1'

Thanks in advance.

EDIT :

They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent. In this .ENT file I see entities like that

<!ENTITY Cacute "&#262;"> <!-- latin capital letter C with acute,
                                  U+0106 Latin Extended-A -->

How can I reference it in my xml Parsing ? I prefer to reference online since they may add new anytime. Thanks in advance !!!

Upvotes: 2

Views: 1094

Answers (5)

Nic Gibson
Nic Gibson

Reputation: 7143

The first thing to be aware of is that the problem isn't in your software.

As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).

However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.

To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:

<!DOCTYPE root-element 
   [ <!ENTITY Ccaron "&#x010C;">
     <!ENTITY ccaron "&#x010D;">]
>

Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.

Alternatively, simply change the &Ccaron; to &#x010C; and your problem will also be resolved.

The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).

You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.

Upvotes: 3

Mads Hansen
Mads Hansen

Reputation: 66714

&Ccaron; is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.

However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.

Upvotes: 1

Rubens Farias
Rubens Farias

Reputation: 57936

Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.

You have two options:

  • Open that file as a regular text file and fixed that symptom.
  • Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.

EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)

Console.WriteLine(
    Regex.Replace("<H0742>&Ccaron;opova 14, &#123; POB & SI-1000 &amp;</H0742>",
    @"&((?!#)\S*?;)?", match =>
    {
        switch (match.Value)
        {
            case "&lt;":
            case "&gt;":
            case "&amp;":
            case "&quot;":
            case "&apos;":
                return match.Value; // correctly encoded

            case "&":
                return "&amp;";

            default: // here you can choose:
                // to remove entire entity:
                return "";
                // or just encode that & character
                return "&amp;" + match.Value.Substring(1);
        }
    }));

Upvotes: 1

ratneshsinghparihar
ratneshsinghparihar

Reputation: 301

solution :-

 byte[] encodedString = Encoding.UTF8.GetBytes(xml);
    // Put the byte array into a stream and rewind it to the beginning 
        MemoryStream ms = new MemoryStream(encodedString);
         ms.Flush();    
     ms.Position = 0;     
     // Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes 
    XmlDocument xmlDoc = new XmlDocument(); 
     xmlDoc.Load(ms); 

Upvotes: 0

John Leidegren
John Leidegren

Reputation: 60987

&Ccaron; isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. &Scaron; is invalid XML but &#352; isn't. (Scaron was the closest I could find to Ccaron).

Upvotes: 2

Related Questions