Jesse Bunch
Jesse Bunch

Reputation: 6679

CoreData Import and Upgrade Strategies

There are many things to love about CoreData, but I feel like data import is not one of them. I have some questions about what you use for import/upgrade strategies and I'd love to hear your input.

There are a few schools of thought on how to provide the initial database content. Some folks will import data from a file on first run and others will provide a data store that is copied and used on first run---which is what I've personally done. If there are any other options, I'd love to hear them.

My problem with each of these methods is what to do when you push an upgrade. What do you do when you've added/changed/removed data to the initial dataset that you want existing users to see?

Do you store the in-use data model's version in NSUserDefaults and then execute some migration code on first run of a new version to insert/update the default data? I'm strictly talking data here, not schema. All of this seems so hacky as I can just see the wave of low ratings coming in because you didn't think of something when writing your upgrade code. Is it even a good thing to store default application data (that the user doesn't really modify) in CoreData?

So I guess my question is, what is your preferred import strategy and how do you normally go about upgrading that data when you release future versions?

Upvotes: 5

Views: 1165

Answers (3)

David Rönnqvist
David Rönnqvist

Reputation: 56625

tl;dr Before committing to any solution, you should think about how you are expecting the update to behave. Any update in general can either be a full replace or a partial replace (diff first, then replace those parts). Both full and partial replaces have their pros and cons. The technical implementation of both kinds of solutions can vary a lot.


If I understand this correctly you have an initial set of data that comes with your application. You want to change the initial data set in a new release of your application. The data may or may not have been modified by the user.

The way I see it, a good solution to this problem depends on your application and the way the default data is currently being stored and used in your application.

User entities and initial entities (possibly modified) together

If the application is created with its default data set and the user later is able to modify those entires, remove those entires, add its own entires and it cannot be determined if an entry is part of the initial data set or is a user modified entry, then an update of the default application data set is a lot more tricky but also more interesting.

Since you'r asking "What IF the data does get modified?", I'm assuming that you also find this case the more interesting.

What is your expected behavior?

Personally, I would try to define the exact behavior that you are expecting before getting into the technical solution. Some cases are very simple, like these:

  • "if an entry in the default data set should be removed and the user has already removed it, then do nothing"

  • "if a new entry should be added to the default data set, then add it"

  • "if an entry is not a part of the default data set, then do nothing".

However, there are many subtle variations of these, like

  • "if an entry in the default date set should be removed and the user has modified it, then ...".

In this very case you should probably consider the data as part of the users data and not modify it but maybe you have a good reason to update it anyway. When you start writing these down you will see quite clearly that there are many cases that you may not have though of before. Also, by writing these cases down, you document your decisions so that you can go back and look at them later.

What are you planning for the future?

Once you have decided in detail what the goals are for the data update you can go on thinking about how to implement a solution for these goals. This is also a good time to start thinking about the future. If you feel the need to update your initial data set now then chances are that you will be likely to want to update them again in the future. Maybe this is a good time to think about how you can make updates like these easier in the future. Maybe this is a good time to update the schema after all. But maybe not. Some solutions to the update problem doesn't require a schema update.

Designing for future data updates

If, by chance you have had the feeling of "if only XYZ" while thinking about how to update this data. Then you probably have a place to start designing your future update mechanism. Without knowing more about the complexity of your data, the size of it or the approximate ratio of inserts, deletes and unaltered entires in an update, it is very difficult to give concrete tips about how to design a good update solution. However I will try to point out things to consider.

Going to a very high level of abstraction, there is two main ways to update a set of data: replace everything or calculate the difference and replace only what has changed


1. Replace everything

Design

If the amount of initial data is very small, i.e. small enough to not demand an advanced update mechanism, you could simply update the entire set of default data on every update. To be able to replace the default data without altering the users data you will need to either separate the default data from the users data (or having a solution where it is already separate) or at least be able to identify if an entry is part of the initial data set or not.

Separating default data from user data

To be able to simply "replace all old default data with the new default data" it is required that all old data can be identified and deleted. This can be done in a few different ways. If it is possible to heuristically identify if an entry is part of the default data set or not, maybe through a timestamp when it was created or something like that, then no major modifications need to be made. All those entries can be identified as default data. If not, there first update will be more difficult.

As stated above, you should design for more future updates. Therefore, if you cannot identify what data is part of the default data set and what is the users data then you should probably modify your model so that there is some way that these can be separated. A simple boolean value is a very minor modification.

It is worth noting that big Core Data deletes can be very slow since Core Data does a lot of work behind the scene following relationships and taking actions according to delete rules for each relationship. If deleting a whole set of data it will most likely be faster to separate the default data into its own store, i.e its own SQLite file on disc. Then the entire SQLite file can be deleted since all entities in it will be deleted. This will however increase the complexity of the solution so measure the time a delete takes before making any performance decisions.

What about modified entries?

As mentioned above there are a few different things that can be done with a modified entity upon an update and depending on whether or not modified entities should be considered as the users entities or not, these entities have to be changed so that they appear as the users entities to the update mechanism (i.e. the thing that deletes all default entities).

(A side note: Should a default entity that gets modified and then modified back to its original value be considered a user entity or a default entity? How can we keep track of such changes?)

The update procedure

Depending on whether or not the initial data is stored separately and whether or not you choose to separate it, there may be need for an migration the first time. There may also be need for a migration if it cannot be determined what data is part of the default set. After having migrated, if needed, the initial data it can be updated separately without migration for future data updates.

Depending on the exact solution, it could be possible to do the updates in the background with a parent/child-context. This is described further in solution 2 (the "diff").

Pros:

  • Less code to write
  • Can handle any complexity of data by replacing it all

Cons:

  • May require migration to split the default data from the users data
  • May cause entries in the default data set that have been deleted or modified to come back when updating data
  • Possibly a more complex data model
  • Will have poor performance on large sets of data

2. Replace only the difference

Design

Depending on how complex your data is and the ratio of altered vs un-altered entries in an update, one design that fits a small number of altered entries could be to store all updates separately. This however, requires that all updates can be described this way. If you know the difference between the old default data set and the new default data set then all updates can be described as either deletes, inserts or modifies.

(This resembles how a versioning system works: instead of (in case of versioning) copying the entire file, only the modifications (the "diff") is being added. In case of updating you don't keep the outdated data, you replace it. The benefits are similar though. Update time becomes proportional to the size of the update and not to the total size of the data.)

Inserts

Inserts are probably the most simple ones. By storing all new entries to be inserted separately, they can be iterated over and added to the users data.

Deletes

Deletes are equally easy as long as entires can be uniquely identified and as long as it can be ensured that they haven't been modified in any way. By storing the necessary information to uniquely identify an entity and ensure that it hasn't been modified. These entries can be fetched and deleted from Core Data.

Modifies

Modified entries can be very tricky depending on the complexity of the changes. Single value modifications are close to trivial but relational modifications open up to lots of new questions that should be looked at (like above) before going further.

How to store updates?

You may have noticed that I've been vary vague about how these updates would be stored. It's because that also depends on the needs and resources at hand. A simple solution would be to include them in the updated application as pre-populated data somehow. However, the updates doesn't necessary have to be stored on the device itself. If the total size of all the updates is small enough they could be located on a server and downloaded to the device in the background. Storing updates on a server gives the huge bonus of being able to push new updates to the data without having to update the application itself.

Anyway, downloading updates or not, once the updates are on the device they should be stored somehow. They could be stored in another Core Data model within your application, in which case you wouldn't have to do a migration since the updates are entities in another model. Storing the updates in flat files, or any other non-Core Data way also has this advantage.

Deciding how to store the updates is similar to deciding how to store any kind of data in the first place. It should be similar to the process you used to decide to use Core Data for your main data.

The update procedure

When the user launches your updated application you wouldn't necessary have to lock the UI to make a long migration since the model itself doesn't necessary have to be changed. Assuming that the updates have somehow gotten on the device and are stored somewhere they can be iterated over in the background. If you are only targeting iOS 5, then you can use a parent/child-context setup to make updates to Core Data in the background. A good resource for how to do background imports in Core Data is the Core Data for Mac, iPhone & iPad Update from iDeveloper.tv. There is of course "What's new in Core Data" WWDC-videos that cover parent/child-context setups as well.

If you go with such a solution, you could create a background context and make all modifications in there on a low priority queue. Depending on the amount of data being updated, I would save off any modifications to the "real" Core Data context in batches and also remove the entities in the update tables that have already been processed. This way, the entire update process would be able to resume where it was if the update took a very long time and the user quit or if the application crashed in the middle of it.

Generally, no matted how you insert or delete large amount of data it is good to save in batches and to in some way indicate what data has already been processed so that the application can resume the import/delete. It doesn't have to save after every entry. If the application crashed and a few entries wan't saved, it is still a huge win if it can resume prior to those entries being processed and process them again. By indicating that certain data has been processed only after it has been saved, this import can know where to resume without missing any data.

If using lists of data to be inserted/deleted/modified: by removing entities from these lists after the changed they represent have been saved in Core Data, the update mechanism can keep track of the inserts/updates/deletes that have not yet been processed.

Once all updates have been saved to the "real" context then you would be left just an empty list of updates.

NOTE: In a parent/child-context you will have to save the "master" context at one point or another because it is the only one that actually persists data to disk. The other saves are only in-memory.

Pros:

  • More performant for small updates or large amount of un-altered data.
  • Possibly a small size of data to transfer/store for the update
  • (If downloading updates) Default data can be updated without having to update the application

Cons:

  • Plenty of code to write
  • Will require a migration of the "update" model when the "normal" model is migrated to keep the two models in sync.

I noticed that this answer got way longer than I first intended to. I realize that I tried to remain very general in my solutions and that this may not be precise enough for your solution. If you'd like, you can comment on my answer and add more details about the your problem and the constraints you have. This way I may be able to better fit a solution to your needs.

Upvotes: 6

Jeff Wolski
Jeff Wolski

Reputation: 6372

You can copy to your machine the sqlite file that was created by Core Data. Modify the data in the file and put out on a server. The app can download the sqlite file with the updated data and use it to replace the sqlite file in the Documents folder in the app.

Then you have to reset the persistent store and Managed Object Context. I do this in a relational database with around 7000 records, and it works like a charm.

Upvotes: 0

hypercrypt
hypercrypt

Reputation: 15376

If it is data that does not get modified you can just create store and store it in the app bundle. You can then open it as read-only directly from the NSBundle. That way you can swap out the whole store with an update and have no problems with migration etc.

Upvotes: 1

Related Questions