How to cut up large JSON file into chunks and sort using GSON

Question

I have a huge JSON file titled something.json. The file is 20 MB. I'm reading this in with GSON. It's being read on a standard Android Nexus 5X.

Example of the Json:

[
    {"country":"UA","name":"Hurzuf","_id":707860,"coord":{"lon":34.283333,"lat":44.549999}},
    {"country":"UA","name":"Il’ichëvka","_id":707716,"coord":{"lon":34.383331,"lat":44.666668}},
    {"country":"BG","name":"Rastnik","_id":727762,"coord":{"lon":25.283331,"lat":41.400002}}
...
]

Code:

@Override
protected ArrayList doInBackground(File... files) {
    ArrayList cities = new ArrayList<>();
    try {
        InputStream is = new FileInputStream(files[0]);
        JsonReader reader = new JsonReader(new InputStreamReader(is, "UTF-8"));
        reader.beginArray();
        while (reader.hasNext()) {
            City city = new Gson().fromJson(reader, City.class);
            cities.add(city);
        }
        reader.endArray();
        reader.close();
    } catch (Exception e) {
        mResult.onFinish(cities, e.getMessage());
    }

    Collections.sort(cities, (o1, o2) -> o1.getName().compareTo(o2.getName()));
    mResult.onFinish(cities, CityService.SUCCESS);
    return cities;
}

Library used:

com.google.code.gson:gson:2.8.0

It needs to work from Android API 16 till the latest.

I need to read this in to mCities, and sort it alphabetically on city name. Right now this takes 3 minutes and it has to be done in around 10 seconds. My approach is to cut up the json file in 10 smaller chunks, read these in, concatenate and sort them.

So my question is: how to divide the file in smaller chunks and is this the correct approach to solve this problem?

Link to the file: http://www.jimclermonts.nl/docs/cities.json

Lyubomyr Shaydariv · Accepted Answer

I mostly never do Android coding per se, but I have some notes and probably ideas for you to go with since this is pure Java. Your reader does very excessive work while reading each element. First of all, you don't need to create Gson every time you need it:

It's immutable and thread-safe.
It's relatively expensive to create.
Instantiating a Gson instance also hits the heap taking more time to execute and then garbage-collect.

Next, there is a difference between only-deserialization and JSON stream reading in Gson: the first may use a heavy type adapters composition under the hood, whilst the latter simply can parse JSON documents token by token. Having that said, you can gain a better performance while reading the JSON stream: your JSON file is really known to have a very strict structure so the high-level parser can be implemented much simpler.

Suppose a simple test suite with different implementations for your problem:

Data objects

City.java

final class City {

    @SerializedName("_id")
    final int id;

    @SerializedName("country")
    final String country;

    @SerializedName("name")
    final String name;

    @SerializedName("coord")
    final Coordinates coordinates;

    private City(final int id, final String country, final String name, final Coordinates coordinates) {
        this.id = id;
        this.country = country;
        this.name = name;
        this.coordinates = coordinates;
    }

    static City of(final int id, final String country, final String name, final Coordinates coordinates) {
        return new City(id, country, name, coordinates);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final City that = (City) o;
        return id == that.id;
    }

    @Override
    public int hashCode() {
        return id;
    }

    @SuppressWarnings("ConstantConditions")
    public static int compareByName(final City city1, final City city2) {
        return city1.name.compareTo(city2.name);
    }

}

Coordinates.java

final class Coordinates {

    @SerializedName("lat")
    final double latitude;

    @SerializedName("lon")
    final double longitude;

    private Coordinates(final double latitude, final double longitude) {
        this.latitude = latitude;
        this.longitude = longitude;
    }

    static Coordinates of(final double latitude, final double longitude) {
        return new Coordinates(latitude, longitude);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final Coordinates that = (Coordinates) o;
        return Double.compare(that.latitude, latitude) == 0
                && Double.compare(that.longitude, longitude) == 0;
    }

    @Override
    public int hashCode() {
        final long latitudeBits = Double.doubleToLongBits(latitude);
        final long longitudeBits = Double.doubleToLongBits(longitude);
        final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
        final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
        return 31 * latitudeHash + longitudeHash;
    }

}

Test infrastructure

ITest.java

interface ITest {

    @Nonnull
    default String getName() {
        return getClass().getSimpleName();
    }

    @Nonnull
    Collection test(@Nonnull JsonReader jsonReader)
            throws IOException;

}

main

    public static void main(final String... args)
            throws IOException {
        final Iterable tests = ImmutableList.of(
                FirstTest.get(),
                ReadAsWholeListTest.get(),
                ReadAsWholeTreeSetTest.get(),
                ReadJsonStreamIntoListTest.get(),
                ReadJsonStreamIntoTreeSetTest.get(),
                ReadJsonStreamIntoListChunksTest.get()
        );
        for ( int i = 0; i < 3; i++ ) {
            for ( final ITest test : tests ) {
                try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
                    for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
                        if ( zipEntry.getName().equals("cities.json") ) {
                            final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
                            System.out.printf("%1$35s : ", test.getName());
                            final Stopwatch stopwatch = Stopwatch.createStarted();
                            final Collection cities = test.test(jsonReader);
                            System.out.printf("in %d ms with %d elements
", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
                            assertSorted(cities, City::compareByName);
                        }
                    }
                }
            }
            System.out.println("--------------------");
        }
    }

    private static  void assertSorted(final Iterable iterable, final Comparator comparator) {
        final Iterator iterator = iterable.iterator();
        if ( !iterator.hasNext() ) {
            return;
        }
        E a = iterator.next();
        if ( !iterator.hasNext() ) {
            return;
        }
        do {
            final E b = iterator.next();
            if ( comparator.compare(a, b) > 0 ) {
                throw new AssertionError(a + " " + b);
            }
            a = b;
        } while ( iterator.hasNext() );
    }

Tests

FirstTest.java

This is the slowest one. And it's just an adaptation of your question to the tests.

final class FirstTest
        implements ITest {

    private static final ITest instance = new FirstTest();

    private FirstTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public List test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        jsonReader.beginArray();
        final List cities = new ArrayList<>();
        while ( jsonReader.hasNext() ) {
            final City city = new Gson().fromJson(jsonReader, City.class);
            cities.add(city);
        }
        jsonReader.endArray();
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeListTest.java

This is most likely how you might implement it. It's not the winner, but it's the simplest one, and it uses default sorting.

final class ReadAsWholeListTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeListTest();

    private ReadAsWholeListTest() {
    }

    static ITest get() {
        return instance;
    }

    private static final Gson gson = new Gson();

    private static final Type citiesListType = new TypeToken>() {
    }.getType();

    @Nonnull
    @Override
    public List test(@Nonnull final JsonReader jsonReader) {
        final List cities = gson.fromJson(jsonReader, citiesListType);
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeTreeSetTest.java

Another idea, if you're not bound to lists, is using an already-sorted collections like TreeSet. Since I don't know if there's a way to specify a new TreeSet comparator mechanism in Gson, it must use a custom type adapter factory (but this is not required if City is already comparable by name, however it's not flexible).

final class ReadAsWholeTreeSetTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeTreeSetTest();

    private ReadAsWholeTreeSetTest() {
    }

    static ITest get() {
        return instance;
    }

    @SuppressWarnings({ "rawtypes", "unchecked" })
    private static final TypeToken> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);

    private static final Map> comparatorsRegistry = ImmutableMap.of(
            City.class, (Comparator) City::compareByName
    );

    private static final Gson gson = new GsonBuilder()
            .registerTypeAdapterFactory(new TypeAdapterFactory() {
                @Override
                public  TypeAdapter create(final Gson gson, final TypeToken typeToken) {
                    if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
                        return null;
                    }
                    final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
                    @SuppressWarnings({ "rawtypes", "unchecked" })
                    final Comparator

How to cut up large JSON file into chunks and sort using GSON

Answers (1)

Data objects

City.java

Coordinates.java

Test infrastructure

ITest.java

main

Tests

FirstTest.java

ReadAsWholeListTest.java

ReadAsWholeTreeSetTest.java

JSON stream reader tests

AbstractJsonStreamTest.java

ReadJsonStreamIntoListTest.java

ReadJsonStreamIntoTreeSetTest.java

ReadJsonStreamIntoListChunksTest.java

Test results

Related Questions