Troubled Magento user
Troubled Magento user

Reputation: 179

Parse large XML file over FTP

I need to parse a large XML file (>1 GB) which is located on a FTP server. I have a FTP stream aquired by ftp_connect(). (I use this stream for other FTP-related actions)

I know XMLReader is preferred for large XML files, but it will only accept a URI. So I assume a stream wrapper will be required. And the only ftp-function I know of which will allow me to retrieve only a small part of the file is ftp_nb_fget() in combination with ftp_nb_continue().

However, I do not know how I should put all of this together to make sure that a minimum amount of memory is used.

Upvotes: 4

Views: 2068

Answers (3)

Gordon
Gordon

Reputation: 317049

Hmm, I never tried that with FTP, but setting the Stream Context can be done with

Then just put in the FTP URI in open().

EDIT: Note that you can use the Stream Context for other actions as well. If you are uploading files, you can probably use the same stream context in combination with file_put_contents, so you dont necessarily need any of the ftp* functions at all.

Upvotes: 0

ircmaxell
ircmaxell

Reputation: 165201

This will depend on the schema of your XML file. But if it's something similar to RSS in that it's really just a long list of items (all encapsulated in a tag), then what I've done is to parse out the individual sections, and parse them as individual domdocuments:

$buffer = '';
while ($line = getLineFromFtp()) {
    $buffer .= $line;
    if (strpos($line, '</item>') !== false) {
        parseBuffer($buffer);
        $buffer = '';
    }
}

That's pseudo code, but it's a light way of handling a specific type of XML file without building your own XMLReader. You'd of course need to check for opening tags as well, to ensure that the buffer is always a valid xml file.

Note that this won't work with all XML types. But if it fits, it's a easy and clean way of doing it while keeping your memory footprint as low as possible...

Upvotes: 0

Charles
Charles

Reputation: 51411

It looks like you may need to build on top of the low-level XML parser bits.

In particular, you can use xml_parse to process XML one chunk of the XML string at a time, after calling the various xml_set_* functions with callbacks to handle elements, character data, namespaces, entities, and so on. Those callbacks will be triggered whenever the parser detects that it has enough data to do so, which should mean that you can process the file as you read it in arbitrarily-sized chunks from the FTP site.


Proof of concept using CLI and xml_set_default_handler, which will get called for everything that doesn't have a specific handler:

php > $p = xml_parser_create('utf-8');
php > xml_set_default_handler($p, function() { print_r(func_get_args()); });
php > xml_parse($p, '<a');
php > xml_parse($p, '>');
php > xml_parse($p, 'Foo<b>Bar</b>Baz');
Array
(
    [0] => Resource id #3
    [1] => <a>
)
Array
(
    [0] => Resource id #3
    [1] => Foo
)
Array
(
    [0] => Resource id #3
    [1] => <b>
)
Array
(
    [0] => Resource id #3
    [1] => Bar
)
Array
(
    [0] => Resource id #3
    [1] => </b>
)
php > xml_parse($p, '</a>');
Array
(
    [0] => Resource id #3
    [1] => Baz
)
Array
(
    [0] => Resource id #3
    [1] => </a>
)
php >

Upvotes: 0

Related Questions