Reputation: 2660
I have a huge JSON file titled something.json. The file is 20 MB. I'm reading this in with GSON. It's being read on a standard Android Nexus 5X.
Example of the Json:
[
{"country":"UA","name":"Hurzuf","_id":707860,"coord":{"lon":34.283333,"lat":44.549999}},
{"country":"UA","name":"Il’ichëvka","_id":707716,"coord":{"lon":34.383331,"lat":44.666668}},
{"country":"BG","name":"Rastnik","_id":727762,"coord":{"lon":25.283331,"lat":41.400002}}
...
]
Code:
@Override
protected ArrayList<City> doInBackground(File... files) {
ArrayList<City> cities = new ArrayList<>();
try {
InputStream is = new FileInputStream(files[0]);
JsonReader reader = new JsonReader(new InputStreamReader(is, "UTF-8"));
reader.beginArray();
while (reader.hasNext()) {
City city = new Gson().fromJson(reader, City.class);
cities.add(city);
}
reader.endArray();
reader.close();
} catch (Exception e) {
mResult.onFinish(cities, e.getMessage());
}
Collections.sort(cities, (o1, o2) -> o1.getName().compareTo(o2.getName()));
mResult.onFinish(cities, CityService.SUCCESS);
return cities;
}
Library used:
com.google.code.gson:gson:2.8.0
It needs to work from Android API 16 till the latest.
I need to read this in to mCities, and sort it alphabetically on city name. Right now this takes 3 minutes and it has to be done in around 10 seconds. My approach is to cut up the json file in 10 smaller chunks, read these in, concatenate and sort them.
So my question is: how to divide the file in smaller chunks and is this the correct approach to solve this problem?
Link to the file: http://www.jimclermonts.nl/docs/cities.json
Upvotes: 3
Views: 2497
Reputation: 21115
I mostly never do Android coding per se, but I have some notes and probably ideas for you to go with since this is pure Java.
Your reader does very excessive work while reading each element.
First of all, you don't need to create Gson
every time you need it:
Gson
instance also hits the heap taking more time to execute and then garbage-collect.Next, there is a difference between only-deserialization and JSON stream reading in Gson: the first may use a heavy type adapters composition under the hood, whilst the latter simply can parse JSON documents token by token. Having that said, you can gain a better performance while reading the JSON stream: your JSON file is really known to have a very strict structure so the high-level parser can be implemented much simpler.
Suppose a simple test suite with different implementations for your problem:
final class City {
@SerializedName("_id")
final int id;
@SerializedName("country")
final String country;
@SerializedName("name")
final String name;
@SerializedName("coord")
final Coordinates coordinates;
private City(final int id, final String country, final String name, final Coordinates coordinates) {
this.id = id;
this.country = country;
this.name = name;
this.coordinates = coordinates;
}
static City of(final int id, final String country, final String name, final Coordinates coordinates) {
return new City(id, country, name, coordinates);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final City that = (City) o;
return id == that.id;
}
@Override
public int hashCode() {
return id;
}
@SuppressWarnings("ConstantConditions")
public static int compareByName(final City city1, final City city2) {
return city1.name.compareTo(city2.name);
}
}
final class Coordinates {
@SerializedName("lat")
final double latitude;
@SerializedName("lon")
final double longitude;
private Coordinates(final double latitude, final double longitude) {
this.latitude = latitude;
this.longitude = longitude;
}
static Coordinates of(final double latitude, final double longitude) {
return new Coordinates(latitude, longitude);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final Coordinates that = (Coordinates) o;
return Double.compare(that.latitude, latitude) == 0
&& Double.compare(that.longitude, longitude) == 0;
}
@Override
public int hashCode() {
final long latitudeBits = Double.doubleToLongBits(latitude);
final long longitudeBits = Double.doubleToLongBits(longitude);
final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
return 31 * latitudeHash + longitudeHash;
}
}
interface ITest {
@Nonnull
default String getName() {
return getClass().getSimpleName();
}
@Nonnull
Collection<City> test(@Nonnull JsonReader jsonReader)
throws IOException;
}
public static void main(final String... args)
throws IOException {
final Iterable<ITest> tests = ImmutableList.of(
FirstTest.get(),
ReadAsWholeListTest.get(),
ReadAsWholeTreeSetTest.get(),
ReadJsonStreamIntoListTest.get(),
ReadJsonStreamIntoTreeSetTest.get(),
ReadJsonStreamIntoListChunksTest.get()
);
for ( int i = 0; i < 3; i++ ) {
for ( final ITest test : tests ) {
try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
if ( zipEntry.getName().equals("cities.json") ) {
final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
System.out.printf("%1$35s : ", test.getName());
final Stopwatch stopwatch = Stopwatch.createStarted();
final Collection<City> cities = test.test(jsonReader);
System.out.printf("in %d ms with %d elements\n", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
assertSorted(cities, City::compareByName);
}
}
}
}
System.out.println("--------------------");
}
}
private static <E> void assertSorted(final Iterable<? extends E> iterable, final Comparator<? super E> comparator) {
final Iterator<? extends E> iterator = iterable.iterator();
if ( !iterator.hasNext() ) {
return;
}
E a = iterator.next();
if ( !iterator.hasNext() ) {
return;
}
do {
final E b = iterator.next();
if ( comparator.compare(a, b) > 0 ) {
throw new AssertionError(a + " " + b);
}
a = b;
} while ( iterator.hasNext() );
}
This is the slowest one. And it's just an adaptation of your question to the tests.
final class FirstTest
implements ITest {
private static final ITest instance = new FirstTest();
private FirstTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final List<City> cities = new ArrayList<>();
while ( jsonReader.hasNext() ) {
final City city = new Gson().fromJson(jsonReader, City.class);
cities.add(city);
}
jsonReader.endArray();
cities.sort(City::compareByName);
return cities;
}
}
This is most likely how you might implement it. It's not the winner, but it's the simplest one, and it uses default sorting.
final class ReadAsWholeListTest
implements ITest {
private static final ITest instance = new ReadAsWholeListTest();
private ReadAsWholeListTest() {
}
static ITest get() {
return instance;
}
private static final Gson gson = new Gson();
private static final Type citiesListType = new TypeToken<List<City>>() {
}.getType();
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader) {
final List<City> cities = gson.fromJson(jsonReader, citiesListType);
cities.sort(City::compareByName);
return cities;
}
}
Another idea, if you're not bound to lists, is using an already-sorted collections like TreeSet
.
Since I don't know if there's a way to specify a new TreeSet
comparator mechanism in Gson
, it must use a custom type adapter factory (but this is not required if City
is already comparable by name, however it's not flexible).
final class ReadAsWholeTreeSetTest
implements ITest {
private static final ITest instance = new ReadAsWholeTreeSetTest();
private ReadAsWholeTreeSetTest() {
}
static ITest get() {
return instance;
}
@SuppressWarnings({ "rawtypes", "unchecked" })
private static final TypeToken<TreeSet<?>> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);
private static final Map<Type, Comparator<?>> comparatorsRegistry = ImmutableMap.of(
City.class, (Comparator<City>) City::compareByName
);
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
@Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
return null;
}
final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
@SuppressWarnings({ "rawtypes", "unchecked" })
final Comparator<Object> comparator = (Comparator) comparatorsRegistry.get(elementType);
if ( comparator == null ) {
return null;
}
final TypeAdapter<TreeSet<?>> originalTreeSetTypeAdapter = gson.getDelegateAdapter(this, rawTreeSetType);
final TypeAdapter<?> originalElementTypeAdapter = gson.getDelegateAdapter(this, TypeToken.get(elementType));
final TypeAdapter<TreeSet<Object>> treeSetTypeAdapter = new TypeAdapter<TreeSet<Object>>() {
@Override
public void write(final JsonWriter jsonWriter, final TreeSet<Object> treeSet)
throws IOException {
originalTreeSetTypeAdapter.write(jsonWriter, treeSet);
}
@Override
public TreeSet<Object> read(final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final TreeSet<Object> elements = new TreeSet<>(comparator);
while ( jsonReader.hasNext() ) {
final Object element = originalElementTypeAdapter.read(jsonReader);
elements.add(element);
}
return elements;
}
}.nullSafe();
@SuppressWarnings({ "rawtypes", "unchecked" })
final TypeAdapter<T> castTreeSetTypeAdapter = (TypeAdapter<T>) treeSetTypeAdapter;
return castTreeSetTypeAdapter;
}
})
.create();
private static final Type citiesSetType = new TypeToken<TreeSet<City>>() {
}.getType();
@Nonnull
@Override
public Set<City> test(@Nonnull final JsonReader jsonReader) {
return gson.fromJson(jsonReader, citiesSetType);
}
}
The following class is a special reader test that uses a simplified strategy of reading the cities JSON.
It's probably as simplest as possible (in terms of JSON structure analysis), and it requires the JSON document to be very strict.
abstract class AbstractJsonStreamTest
implements ITest {
protected static void read(final JsonReader jsonReader, final Consumer<? super City> cityConsumer)
throws IOException {
jsonReader.beginArray();
while ( jsonReader.hasNext() ) {
jsonReader.beginObject();
require(jsonReader, "country");
final String country = jsonReader.nextString();
require(jsonReader, "name");
final String name = jsonReader.nextString();
require(jsonReader, "_id");
final int id = jsonReader.nextInt();
require(jsonReader, "coord");
jsonReader.beginObject();
require(jsonReader, "lon");
final double longitude = jsonReader.nextDouble();
require(jsonReader, "lat");
final double latitude = jsonReader.nextDouble();
jsonReader.endObject();
jsonReader.endObject();
final City city = City.of(id, country, name, Coordinates.of(latitude, longitude));
cityConsumer.accept(city);
}
jsonReader.endArray();
}
private static void require(final JsonReader jsonReader, final String expectedName)
throws IOException {
final String actualName = jsonReader.nextName();
if ( !actualName.equals(expectedName) ) {
throw new JsonParseException("Expected " + expectedName + " but was " + actualName);
}
}
}
This one is pretty much like ReadAsWholeListTest
but it uses a simplified deserialization mechanism.
final class ReadJsonStreamIntoListTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListTest();
private ReadJsonStreamIntoListTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final List<City> cities = new ArrayList<>();
read(jsonReader, cities::add);
cities.sort(City::compareByName);
return cities;
}
}
This one, like the previous one, is also just another implementation of a more expensive implementation (ReadAsWholeTreeSetTest
), however it does not require a custom type adatpter.
final class ReadJsonStreamIntoTreeSetTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoTreeSetTest();
private ReadJsonStreamIntoTreeSetTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<City> cities = new TreeSet<>(City::compareByName);
read(jsonReader, cities::add);
return cities;
}
}
The following test is based on your initial idea, but it does not sort the chunks in parallel (I'm not sure but you could give it a try). I still think that the previous two are simpler and probably easier to maintain and give more performance gain.
final class ReadJsonStreamIntoListChunksTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListChunksTest();
private ReadJsonStreamIntoListChunksTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<List<City>> cityChunks = new ArrayList<>();
final AtomicReference<List<City>> cityChunkRef = new AtomicReference<>(new ArrayList<>());
read(jsonReader, city -> {
final List<City> cityChunk = cityChunkRef.get();
cityChunk.add(city);
if ( cityChunk.size() >= 10000 ) {
cityChunks.add(cityChunk);
cityChunkRef.set(new ArrayList<>());
}
});
if ( !cityChunkRef.get().isEmpty() ) {
cityChunks.add(cityChunkRef.get());
}
for ( final List<City> cities : cityChunks ) {
Collections.sort(cities, City::compareByName);
}
return merge(cityChunks, City::compareByName);
}
/**
* <p>Adapted from:</p>
* <ul>
* <li>Original question: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list</li>
* <li>Accepted answer: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list/1775748#1775748</li>
* </ul>
*/
@SuppressWarnings("MethodCallInLoopCondition")
private static <E> List<E> merge(final Iterable<? extends List<E>> lists, final Comparator<? super E> comparator) {
int totalSize = 0;
for ( final List<E> l : lists ) {
totalSize += l.size();
}
final List<E> result = new ArrayList<>(totalSize);
while ( result.size() < totalSize ) { // while we still have something to add
List<E> lowest = null;
for ( final List<E> l : lists ) {
if ( !l.isEmpty() ) {
if ( lowest == null || comparator.compare(l.get(0), lowest.get(0)) <= 0 ) {
lowest = l;
}
}
}
assert lowest != null;
result.add(lowest.get(0));
lowest.remove(0);
}
return result;
}
}
For my desktop JRE I could obtain the following test results:
FirstTest : in 5797 ms with 209557 elements
ReadAsWholeListTest : in 796 ms with 209557 elements
ReadAsWholeTreeSetTest : in 733 ms with 162006 elements
ReadJsonStreamIntoListTest : in 461 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 452 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 607 ms with 209557 elements
--------------------
FirstTest : in 3396 ms with 209557 elements
ReadAsWholeListTest : in 493 ms with 209557 elements
ReadAsWholeTreeSetTest : in 520 ms with 162006 elements
ReadJsonStreamIntoListTest : in 385 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 377 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 540 ms with 209557 elements
--------------------
FirstTest : in 3448 ms with 209557 elements
ReadAsWholeListTest : in 429 ms with 209557 elements
ReadAsWholeTreeSetTest : in 421 ms with 162006 elements
ReadJsonStreamIntoListTest : in 400 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 385 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 480 ms with 209557 elements
--------------------
As you can see, creating excessive Gson
instance is definitely a wrong idea.
The more optimized tests gain better performance.
However, splitting a large list into sorted chunks (no parallel) to be merged later does not give much performance gain in my environment.
For simplicity and probably the best choice, I would go with ReadJsonStreamInto_Collection_Test
depending on the required collection.
I'm not really sure how fine would it work in a real Android environment, but you can simply do some JSON deserialization a bit better than Gson does using its internals.
By the way:
_id
is the identity).cities.json
in advance on your workstation before you use it on an Android device? Additionally, you might want to filter out the duplicates if my above assumption is correct.Upvotes: 4