Reputation: 2082
So I have looked up and tried out several things to convert an XML file to CSV file. The ways I have tried are:
XSLT
: Get XSLT
for given XML and then form CSV. But this is too difficult to maintain as we don't know what kind of XML file we are gonna get which makes it not generic solution.Digester
SAXP
and JAXP
The problem with above two approaches is that it requires defining your java objects before and as such it is again a bottleneck to create so many classes as we do not know what pattern of XML would be there. It will be changing everytime.DocumentBuildFactory
and parsing it all along. This works for generic XML files but it is slow for files which are in range of 5MB to 1GB. My XML files won't be greater than 1GB for sure. Apart from these approaches, which I have tried already, any ideas of how I can achieve it programmatically and faster than above? I have looked at several online tools which do converts any XML file into CSV files in very less time and they seem to work for any generic XML file. Any suggestions?
Here are different examples that might come, which may change as well:
<?xml version="1.0"?>
<Company>
<Employee id="1">
<Email>[email protected]</Email>
<artist>Bob Dylan</artist>
<country>USA</country>
</Employee>
</Company>
This is the simplest one. Expected output is:
Company/Employee/Email,Company/Employee/artist,Company/Employee/country,Company/Employee/_id
[email protected],Bob Dylan,USA,1
Another example
<?xml version="1.0"?>
<Company>
<Employee id="1">
<Email>[email protected]</Email>
<UserData id="id32" type="AttributesInContext">
<UserValue value="7in" title="Height"></UserValue>
<UserValue value="" title="Weight"></UserValue></UserData>
</Employee>
<Employee id="2">
<Email>[email protected]</Email>
<UserData id="id33" type="AttributesInContext">
<UserValue value="6in" title="Height"></UserValue>
<UserValue value="" title="Weight"></UserValue></UserData>
</Employee>
<Employee id="3">
<Email>[email protected]</Email>
<UserData id="id34" type="AttributesInContext">
<UserValue value="4in" title="Height"></UserValue>
<UserValue value="" title="Weight"></UserValue></UserData>
</Employee>
</Company>
Expected output is
Email,UserData/UserValue/0/_value,UserData/UserValue/0/_title,UserData/UserValue/1/_value,UserData/UserValue/1/_title,UserData/_id,UserData/_type,_id
[email protected],7in,Height,,Weight,id32,AttributesInContext,1
[email protected],6in,Height,,Weight,id33,AttributesInContext,2
[email protected],4in,Height,,Weight,id34,AttributesInContext,3
This is bit complex one. And this can get more complex and nested and can range upto 1GB at max.
Upvotes: 0
Views: 748
Reputation: 2937
You can try to use Java StAX API for this propose.
For example:
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StringWriter;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
public class XmlToCSV {
public static void convert(InputStream xml, OutputStream csv) throws Exception {
try (StringWriter header = new StringWriter(4096); StringWriter content = new StringWriter(4096)) {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader xmlEventReader = factory.createXMLEventReader(xml);
XMLEvent xmlEvent;
long nestingLevel = -1;
StringBuilder line = null;
while (xmlEventReader.hasNext()) {
xmlEvent = xmlEventReader.nextEvent();
switch (xmlEvent.getEventType()) {
case XMLEvent.START_ELEMENT:
++nestingLevel;
if (0 == nestingLevel) {
break;
} else if (1 == nestingLevel) {
line = new StringBuilder();
}
StartElement startElement = xmlEvent.asStartElement();
serializeElementHeader(header, line, startElement);
break;
case XMLEvent.CHARACTERS:
case XMLEvent.CDATA:
if (nestingLevel < 1)
break;
Characters chars = xmlEvent.asCharacters();
if (!chars.isWhiteSpace()) {
line.append(chars.getData());
line.append(',');
}
break;
case XMLEvent.END_ELEMENT:
if (--nestingLevel == 0) {
header.write("0/");
for(int i= line.length()-1; ',' == line.charAt(i); --i) {
line.deleteCharAt(i);
}
content.write(line.toString());
content.write('\n');
}
break;
default:
break;
}
}
// write csv
try (Writer cvsWriter = new OutputStreamWriter(csv, StandardCharsets.UTF_8.name())) {
cvsWriter.write(header.toString());
cvsWriter.write('\n');
cvsWriter.write(content.toString());
}
}
}
private static void serializeElementHeader(StringWriter header, StringBuilder line,
StartElement startElement) {
header.write(startElement.getName().getLocalPart());
header.write('/');
Iterator<Attribute> it = startElement.getAttributes();
while(it.hasNext()) {
Attribute attr = it.next();
header.write('_');
header.write(attr.getName().getLocalPart());
header.write('/');
line.append(attr.getValue());
line.append(',');
}
}
private static String TEST_XML = "<?xml version='1.0'?>"
+ "<Company>"
+ " <Employee id='1'>"
+ " <Email>[email protected]</Email>"
+ " <UserData id='id32' type='AttributesInContext'>"
+ " <UserValue value='7in' title='Heigh'></UserValue>"
+ " <UserValue value='' title='Weight'></UserValue>"
+ " </UserData>"
+ " </Employee>"
+ " <Employee id='2'>"
+ " <Email>[email protected]</Email>"
+ " <UserData id='id33' type='AttributesInContext'>"
+ " <UserValue value='6in' title='Heigh'></UserValue>"
+ " <UserValue value='' title='Weight'></UserValue>"
+ " </UserData>"
+ " </Employee>"
+ " <Employee id='3'>"
+ " <Email>[email protected]</Email>"
+ " <UserData id='id34' type='AttributesInContext'>"
+ " <UserValue value='4in' title='Heigh'></UserValue>"
+ " <UserValue value='' title='Weight'></UserValue>"
+ " </UserData>"
+ " </Employee>"
+ "</Company>";
public static void main(String[] args) throws Exception {
try (InputStream in = new ByteArrayInputStream(TEST_XML.getBytes(Charset.defaultCharset()));
ByteArrayOutputStream out = new ByteArrayOutputStream(4096)) {
convert(in, out);
System.out.print(out.toString());
}
}
Upvotes: 1