Jim Clermonts
Jim Clermonts

Reputation: 2660

How to cut up large JSON file into chunks and sort using GSON

I have a huge JSON file titled something.json. The file is 20 MB. I'm reading this in with GSON. It's being read on a standard Android Nexus 5X.

Example of the Json:

[
    {"country":"UA","name":"Hurzuf","_id":707860,"coord":{"lon":34.283333,"lat":44.549999}},
    {"country":"UA","name":"Il’ichëvka","_id":707716,"coord":{"lon":34.383331,"lat":44.666668}},
    {"country":"BG","name":"Rastnik","_id":727762,"coord":{"lon":25.283331,"lat":41.400002}}
...
]

Code:

@Override
protected ArrayList<City> doInBackground(File... files) {
    ArrayList<City> cities = new ArrayList<>();
    try {
        InputStream is = new FileInputStream(files[0]);
        JsonReader reader = new JsonReader(new InputStreamReader(is, "UTF-8"));
        reader.beginArray();
        while (reader.hasNext()) {
            City city = new Gson().fromJson(reader, City.class);
            cities.add(city);
        }
        reader.endArray();
        reader.close();
    } catch (Exception e) {
        mResult.onFinish(cities, e.getMessage());
    }

    Collections.sort(cities, (o1, o2) -> o1.getName().compareTo(o2.getName()));
    mResult.onFinish(cities, CityService.SUCCESS);
    return cities;
}

Library used:

com.google.code.gson:gson:2.8.0

It needs to work from Android API 16 till the latest.

I need to read this in to mCities, and sort it alphabetically on city name. Right now this takes 3 minutes and it has to be done in around 10 seconds. My approach is to cut up the json file in 10 smaller chunks, read these in, concatenate and sort them.

So my question is: how to divide the file in smaller chunks and is this the correct approach to solve this problem?

Link to the file: http://www.jimclermonts.nl/docs/cities.json

Upvotes: 3

Views: 2497

Answers (1)

Lyubomyr Shaydariv
Lyubomyr Shaydariv

Reputation: 21115

I mostly never do Android coding per se, but I have some notes and probably ideas for you to go with since this is pure Java. Your reader does very excessive work while reading each element. First of all, you don't need to create Gson every time you need it:

  • It's immutable and thread-safe.
  • It's relatively expensive to create.
  • Instantiating a Gson instance also hits the heap taking more time to execute and then garbage-collect.

Next, there is a difference between only-deserialization and JSON stream reading in Gson: the first may use a heavy type adapters composition under the hood, whilst the latter simply can parse JSON documents token by token. Having that said, you can gain a better performance while reading the JSON stream: your JSON file is really known to have a very strict structure so the high-level parser can be implemented much simpler.

Suppose a simple test suite with different implementations for your problem:

Data objects

City.java

final class City {

    @SerializedName("_id")
    final int id;

    @SerializedName("country")
    final String country;

    @SerializedName("name")
    final String name;

    @SerializedName("coord")
    final Coordinates coordinates;

    private City(final int id, final String country, final String name, final Coordinates coordinates) {
        this.id = id;
        this.country = country;
        this.name = name;
        this.coordinates = coordinates;
    }

    static City of(final int id, final String country, final String name, final Coordinates coordinates) {
        return new City(id, country, name, coordinates);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final City that = (City) o;
        return id == that.id;
    }

    @Override
    public int hashCode() {
        return id;
    }

    @SuppressWarnings("ConstantConditions")
    public static int compareByName(final City city1, final City city2) {
        return city1.name.compareTo(city2.name);
    }

}

Coordinates.java

final class Coordinates {

    @SerializedName("lat")
    final double latitude;

    @SerializedName("lon")
    final double longitude;

    private Coordinates(final double latitude, final double longitude) {
        this.latitude = latitude;
        this.longitude = longitude;
    }

    static Coordinates of(final double latitude, final double longitude) {
        return new Coordinates(latitude, longitude);
    }

    @Override
    public boolean equals(final Object o) {
        if ( this == o ) {
            return true;
        }
        if ( o == null || getClass() != o.getClass() ) {
            return false;
        }
        final Coordinates that = (Coordinates) o;
        return Double.compare(that.latitude, latitude) == 0
                && Double.compare(that.longitude, longitude) == 0;
    }

    @Override
    public int hashCode() {
        final long latitudeBits = Double.doubleToLongBits(latitude);
        final long longitudeBits = Double.doubleToLongBits(longitude);
        final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
        final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
        return 31 * latitudeHash + longitudeHash;
    }

}

Test infrastructure

ITest.java

interface ITest {

    @Nonnull
    default String getName() {
        return getClass().getSimpleName();
    }

    @Nonnull
    Collection<City> test(@Nonnull JsonReader jsonReader)
            throws IOException;

}

main

    public static void main(final String... args)
            throws IOException {
        final Iterable<ITest> tests = ImmutableList.of(
                FirstTest.get(),
                ReadAsWholeListTest.get(),
                ReadAsWholeTreeSetTest.get(),
                ReadJsonStreamIntoListTest.get(),
                ReadJsonStreamIntoTreeSetTest.get(),
                ReadJsonStreamIntoListChunksTest.get()
        );
        for ( int i = 0; i < 3; i++ ) {
            for ( final ITest test : tests ) {
                try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
                    for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
                        if ( zipEntry.getName().equals("cities.json") ) {
                            final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
                            System.out.printf("%1$35s : ", test.getName());
                            final Stopwatch stopwatch = Stopwatch.createStarted();
                            final Collection<City> cities = test.test(jsonReader);
                            System.out.printf("in %d ms with %d elements\n", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
                            assertSorted(cities, City::compareByName);
                        }
                    }
                }
            }
            System.out.println("--------------------");
        }
    }

    private static <E> void assertSorted(final Iterable<? extends E> iterable, final Comparator<? super E> comparator) {
        final Iterator<? extends E> iterator = iterable.iterator();
        if ( !iterator.hasNext() ) {
            return;
        }
        E a = iterator.next();
        if ( !iterator.hasNext() ) {
            return;
        }
        do {
            final E b = iterator.next();
            if ( comparator.compare(a, b) > 0 ) {
                throw new AssertionError(a + " " + b);
            }
            a = b;
        } while ( iterator.hasNext() );
    }

Tests

FirstTest.java

This is the slowest one. And it's just an adaptation of your question to the tests.

final class FirstTest
        implements ITest {

    private static final ITest instance = new FirstTest();

    private FirstTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        jsonReader.beginArray();
        final List<City> cities = new ArrayList<>();
        while ( jsonReader.hasNext() ) {
            final City city = new Gson().fromJson(jsonReader, City.class);
            cities.add(city);
        }
        jsonReader.endArray();
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeListTest.java

This is most likely how you might implement it. It's not the winner, but it's the simplest one, and it uses default sorting.

final class ReadAsWholeListTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeListTest();

    private ReadAsWholeListTest() {
    }

    static ITest get() {
        return instance;
    }

    private static final Gson gson = new Gson();

    private static final Type citiesListType = new TypeToken<List<City>>() {
    }.getType();

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader) {
        final List<City> cities = gson.fromJson(jsonReader, citiesListType);
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadAsWholeTreeSetTest.java

Another idea, if you're not bound to lists, is using an already-sorted collections like TreeSet. Since I don't know if there's a way to specify a new TreeSet comparator mechanism in Gson, it must use a custom type adapter factory (but this is not required if City is already comparable by name, however it's not flexible).

final class ReadAsWholeTreeSetTest
        implements ITest {

    private static final ITest instance = new ReadAsWholeTreeSetTest();

    private ReadAsWholeTreeSetTest() {
    }

    static ITest get() {
        return instance;
    }

    @SuppressWarnings({ "rawtypes", "unchecked" })
    private static final TypeToken<TreeSet<?>> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);

    private static final Map<Type, Comparator<?>> comparatorsRegistry = ImmutableMap.of(
            City.class, (Comparator<City>) City::compareByName
    );

    private static final Gson gson = new GsonBuilder()
            .registerTypeAdapterFactory(new TypeAdapterFactory() {
                @Override
                public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
                    if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
                        return null;
                    }
                    final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
                    @SuppressWarnings({ "rawtypes", "unchecked" })
                    final Comparator<Object> comparator = (Comparator) comparatorsRegistry.get(elementType);
                    if ( comparator == null ) {
                        return null;
                    }
                    final TypeAdapter<TreeSet<?>> originalTreeSetTypeAdapter = gson.getDelegateAdapter(this, rawTreeSetType);
                    final TypeAdapter<?> originalElementTypeAdapter = gson.getDelegateAdapter(this, TypeToken.get(elementType));
                    final TypeAdapter<TreeSet<Object>> treeSetTypeAdapter = new TypeAdapter<TreeSet<Object>>() {
                        @Override
                        public void write(final JsonWriter jsonWriter, final TreeSet<Object> treeSet)
                                throws IOException {
                            originalTreeSetTypeAdapter.write(jsonWriter, treeSet);
                        }

                        @Override
                        public TreeSet<Object> read(final JsonReader jsonReader)
                                throws IOException {
                            jsonReader.beginArray();
                            final TreeSet<Object> elements = new TreeSet<>(comparator);
                            while ( jsonReader.hasNext() ) {
                                final Object element = originalElementTypeAdapter.read(jsonReader);
                                elements.add(element);
                            }
                            return elements;
                        }
                    }.nullSafe();
                    @SuppressWarnings({ "rawtypes", "unchecked" })
                    final TypeAdapter<T> castTreeSetTypeAdapter = (TypeAdapter<T>) treeSetTypeAdapter;
                    return castTreeSetTypeAdapter;
                }
            })
            .create();

    private static final Type citiesSetType = new TypeToken<TreeSet<City>>() {
    }.getType();

    @Nonnull
    @Override
    public Set<City> test(@Nonnull final JsonReader jsonReader) {
        return gson.fromJson(jsonReader, citiesSetType);
    }

}

JSON stream reader tests

The following class is a special reader test that uses a simplified strategy of reading the cities JSON.

AbstractJsonStreamTest.java

It's probably as simplest as possible (in terms of JSON structure analysis), and it requires the JSON document to be very strict.

abstract class AbstractJsonStreamTest
        implements ITest {

    protected static void read(final JsonReader jsonReader, final Consumer<? super City> cityConsumer)
            throws IOException {
        jsonReader.beginArray();
        while ( jsonReader.hasNext() ) {
            jsonReader.beginObject();
            require(jsonReader, "country");
            final String country = jsonReader.nextString();
            require(jsonReader, "name");
            final String name = jsonReader.nextString();
            require(jsonReader, "_id");
            final int id = jsonReader.nextInt();
            require(jsonReader, "coord");
            jsonReader.beginObject();
            require(jsonReader, "lon");
            final double longitude = jsonReader.nextDouble();
            require(jsonReader, "lat");
            final double latitude = jsonReader.nextDouble();
            jsonReader.endObject();
            jsonReader.endObject();
            final City city = City.of(id, country, name, Coordinates.of(latitude, longitude));
            cityConsumer.accept(city);
        }
        jsonReader.endArray();
    }

    private static void require(final JsonReader jsonReader, final String expectedName)
            throws IOException {
        final String actualName = jsonReader.nextName();
        if ( !actualName.equals(expectedName) ) {
            throw new JsonParseException("Expected " + expectedName + " but was " + actualName);
        }
    }

}

ReadJsonStreamIntoListTest.java

This one is pretty much like ReadAsWholeListTest but it uses a simplified deserialization mechanism.

final class ReadJsonStreamIntoListTest
        extends AbstractJsonStreamTest {

    private static final ITest instance = new ReadJsonStreamIntoListTest();

    private ReadJsonStreamIntoListTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public Collection<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        final List<City> cities = new ArrayList<>();
        read(jsonReader, cities::add);
        cities.sort(City::compareByName);
        return cities;
    }

}

ReadJsonStreamIntoTreeSetTest.java

This one, like the previous one, is also just another implementation of a more expensive implementation (ReadAsWholeTreeSetTest), however it does not require a custom type adatpter.

final class ReadJsonStreamIntoTreeSetTest
        extends AbstractJsonStreamTest {

    private static final ITest instance = new ReadJsonStreamIntoTreeSetTest();

    private ReadJsonStreamIntoTreeSetTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public Collection<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        final Collection<City> cities = new TreeSet<>(City::compareByName);
        read(jsonReader, cities::add);
        return cities;
    }

}

ReadJsonStreamIntoListChunksTest.java

The following test is based on your initial idea, but it does not sort the chunks in parallel (I'm not sure but you could give it a try). I still think that the previous two are simpler and probably easier to maintain and give more performance gain.

final class ReadJsonStreamIntoListChunksTest
        extends AbstractJsonStreamTest {

    private static final ITest instance = new ReadJsonStreamIntoListChunksTest();

    private ReadJsonStreamIntoListChunksTest() {
    }

    static ITest get() {
        return instance;
    }

    @Nonnull
    @Override
    public List<City> test(@Nonnull final JsonReader jsonReader)
            throws IOException {
        final Collection<List<City>> cityChunks = new ArrayList<>();
        final AtomicReference<List<City>> cityChunkRef = new AtomicReference<>(new ArrayList<>());
        read(jsonReader, city -> {
            final List<City> cityChunk = cityChunkRef.get();
            cityChunk.add(city);
            if ( cityChunk.size() >= 10000 ) {
                cityChunks.add(cityChunk);
                cityChunkRef.set(new ArrayList<>());
            }
        });
        if ( !cityChunkRef.get().isEmpty() ) {
            cityChunks.add(cityChunkRef.get());
        }
        for ( final List<City> cities : cityChunks ) {
            Collections.sort(cities, City::compareByName);
        }
        return merge(cityChunks, City::compareByName);
    }

    /**
     * <p>Adapted from:</p>
     * <ul>
     * <li>Original question: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list</li>
     * <li>Accepted answer: https://stackoverflow.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list/1775748#1775748</li>
     * </ul>
     */
    @SuppressWarnings("MethodCallInLoopCondition")
    private static <E> List<E> merge(final Iterable<? extends List<E>> lists, final Comparator<? super E> comparator) {
        int totalSize = 0;
        for ( final List<E> l : lists ) {
            totalSize += l.size();
        }
        final List<E> result = new ArrayList<>(totalSize);
        while ( result.size() < totalSize ) { // while we still have something to add
            List<E> lowest = null;
            for ( final List<E> l : lists ) {
                if ( !l.isEmpty() ) {
                    if ( lowest == null || comparator.compare(l.get(0), lowest.get(0)) <= 0 ) {
                        lowest = l;
                    }
                }
            }
            assert lowest != null;
            result.add(lowest.get(0));
            lowest.remove(0);
        }
        return result;
    }

}

Test results

For my desktop JRE I could obtain the following test results:

                          FirstTest : in 5797 ms with 209557 elements
                ReadAsWholeListTest : in 796 ms with 209557 elements
             ReadAsWholeTreeSetTest : in 733 ms with 162006 elements
         ReadJsonStreamIntoListTest : in 461 ms with 209557 elements
      ReadJsonStreamIntoTreeSetTest : in 452 ms with 162006 elements
   ReadJsonStreamIntoListChunksTest : in 607 ms with 209557 elements
--------------------
                          FirstTest : in 3396 ms with 209557 elements
                ReadAsWholeListTest : in 493 ms with 209557 elements
             ReadAsWholeTreeSetTest : in 520 ms with 162006 elements
         ReadJsonStreamIntoListTest : in 385 ms with 209557 elements
      ReadJsonStreamIntoTreeSetTest : in 377 ms with 162006 elements
   ReadJsonStreamIntoListChunksTest : in 540 ms with 209557 elements
--------------------
                          FirstTest : in 3448 ms with 209557 elements
                ReadAsWholeListTest : in 429 ms with 209557 elements
             ReadAsWholeTreeSetTest : in 421 ms with 162006 elements
         ReadJsonStreamIntoListTest : in 400 ms with 209557 elements
      ReadJsonStreamIntoTreeSetTest : in 385 ms with 162006 elements
   ReadJsonStreamIntoListChunksTest : in 480 ms with 209557 elements
--------------------

As you can see, creating excessive Gson instance is definitely a wrong idea. The more optimized tests gain better performance. However, splitting a large list into sorted chunks (no parallel) to be merged later does not give much performance gain in my environment.

For simplicity and probably the best choice, I would go with ReadJsonStreamInto_Collection_Test depending on the required collection. I'm not really sure how fine would it work in a real Android environment, but you can simply do some JSON deserialization a bit better than Gson does using its internals.

By the way:

  • I'm not really sure, but did you notice 162006 unique cities? Your JSON file probably has some duplicates (at least if its _id is the identity).
  • What if you simply generate a sorted version of cities.json in advance on your workstation before you use it on an Android device? Additionally, you might want to filter out the duplicates if my above assumption is correct.

Upvotes: 4

Related Questions