nish
nish

Reputation: 7280

How to modify a large file remotely

I have a large XML file, ~30 MB.

Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely? I am using the following code to update the file.

import xml.etree.ElementTree as ET

tree = ET.parse("feed.xml")
root = tree.getroot()

skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]

for child in root:
    sku = child.find("Product_Code").text.encode("utf-8")
    if sku in skus:
        print "found"
        i = skus.index(sku)
        child.find("Quantity").text = str(qtys[i])
        child.set('updated', 'yes')

tree.write("feed.xml")

Upvotes: 0

Views: 276

Answers (1)

ntninja
ntninja

Reputation: 1325

Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.

The reason is that there are only three commands in FTP that actually modify a file (Source):

  • APPE: Appends to a file
  • STOR: Uploads a file
  • STOU: Creates a new file on the server with a unique name

What you could do

Track changes

Cache the remote file locally and track changes to the file using the MDTM command.

Pros:

  • Will half the required data transfer in many cases.
  • Hardly requires any change to existing code.
  • Almost zero overhead.

Cons:

  • Other clients will have to download the entire thing every time something changes
    (no change from current situation)

Split up into several files

Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.

Pros:

  • Less data to transfer
  • Allows all scripts that access the data to only download what they need
  • Combinable with suggestion #1

Cons:

  • All existing code has to be adapted
  • Additional overhead when downloading or updating all the data

Switch to a delta-sync protocol

If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).

Pros:

  • Less data transfer
  • Requires little change to existing code

Cons:

  • Might not be available

Do it remotely

You already pointed out that you can't but it still would be the best solution.

What won't help

Switch to a network filesystem

As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.

What to do

Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.

Upvotes: 6

Related Questions