Reputation: 728
I have to read an XML file, that has no root element, to extract contained data. The XML has many elements like these:
<DocumentElement>
<LOG_x0020_ParityRate>
<DATE>12/09/2017 - 00:00</DATE>
<CHANNELNAME>ParityRate</CHANNELNAME>
<SQL>update THROOMDISP set ID_HOTEL = '104', ID_ROOM = '920', NUM = '3', MYDATA = '20171006' where id_hotel =104 and id_room ='920' and MYDATA ='20171006'</SQL>
<ID_HOTEL>104</ID_HOTEL>
<TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
</LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>
<LOG_x0020_ParityRate>
<DATE>12/09/2017 - 00:00</DATE>
<CHANNELNAME>ParityRate</CHANNELNAME>
<SQL>update THROOMDISP set ID_HOTEL = '105', ID_ROOM = '923', NUM = '1', MYDATA = '20171006' where id_hotel =105 and id_room ='923' and MYDATA ='20171006'</SQL>
<ID_HOTEL>105</ID_HOTEL>
<TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
</LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>
<LOG_x0020_ParityRate>
<DATE>12/09/2017 - 00:00</DATE>
<CHANNELNAME>ParityRate</CHANNELNAME>
<SQL>update THROOMDISP set ID_HOTEL = '104', ID_ROOM = '920', NUM = '3', MYDATA = '20171007' where id_hotel =104 and id_room ='920' and MYDATA ='20171007'</SQL>
<ID_HOTEL>104</ID_HOTEL>
<TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
</LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>
I tried to read it as a string, add manually opening and closing tags, and parse it like an XDocument, but it has also some bad formatted tags, like these
</DocumentElement>
<TYPEREQUEST>updateTHROOMPRICE(OK)</TYPEREQUEST>
Where these tags doesn't match any opening tags, and when I call XDocument.Parse
on the resulting string I have exceptions. The file has millions of rows, so I can't read it line by line, or the iteration will last for hours. How can I get rid of all these bad formatted tags and parse the document?
Upvotes: 2
Views: 2319
Reputation: 8637
You can try to use XmlParser:
A Roslyn-inspired full-fidelity XML parser with no dependencies and a simple Visual Studio XML language service.
It pars any bad formed xml.
Upvotes: 0
Reputation: 728
I found a way to solve my problem, I gave up to read it as an XML and I read it as a StreamReader, looking for the text I want to read, so I don't have to fight against the XML format
using (StreamReader strReader = File.OpenText(path))
{
while (!strReader.EndOfStream)
{
string line = strReader.ReadLine();
if (line.Contains("<LOG_x0020_ParityRate>")) {
line = strReader.ReadLine();
string data_ = getTagText(line);
string channelName_ = getTagText( strReader.ReadLine());
string sql_ = getTagText( strReader.ReadLine());
string idHotel_ = getTagText(strReader.ReadLine());
string type_ = getTagText(strReader.ReadLine());
}
}
}
Upvotes: 1
Reputation: 34421
You xml is simply not well formed which often happens when xml data is merged together. Your xml has multiple tags at root level so use XML reader like below :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication4
{
class Program
{
const string FILENAME = @"c:\temp\test.xml";
static void Main(string[] args)
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
XmlReader reader = XmlReader.Create(FILENAME,settings);
while (!reader.EOF)
{
try
{
if (reader.Name != "LOG_x0020_ParityRate")
{
reader.ReadToFollowing("LOG_x0020_ParityRate");
}
if (!reader.EOF)
{
XElement parityRate = (XElement)XElement.ReadFrom(reader);
ParityRate newLog = new ParityRate();
ParityRate.logs.Add(newLog);
newLog.date = DateTime.ParseExact((string)parityRate.Element("DATE"), "MM/dd/yyyy - hh:mm", System.Globalization.CultureInfo.InvariantCulture);
newLog.name = (string)parityRate.Element("CHANNELNAME");
newLog.sql = (string)parityRate.Element("SQL");
newLog.hotel = (int)parityRate.Element("ID_HOTEL");
}
}
catch (Exception ex)
{
}
}
}
}
public class ParityRate
{
public static List<ParityRate> logs = new List<ParityRate>();
public DateTime date { get; set; }
public string name { get; set; }
public string sql { get; set; }
public int hotel { get; set; }
}
}
Upvotes: 2