rhzs
rhzs

Reputation: 526

Groovy - Aggregating and Modelling Data from Complex Nested Maps

I have data in groovy presented in the code snippet below:

def productAvailability = [
  [id: 1, startDate: "2014-12-22", endDate: "2015-01-02", storeId: 1, productId: 1, categoryId: 1],
  [id: 4, startDate: "2014-12-24", endDate: "2015-01-08", storeId: 2, productId: 1, categoryId: 1],
  [id: 8, startDate: "2014-12-25", endDate: "2015-01-01", storeId: 2, productId: 3, categoryId: 1],
  [id: 9, startDate: "2014-12-22", endDate: "2015-01-02", storeId: 1, productId: 3, categoryId: 1],
  [id: 10, startDate: "2015-01-10", endDate: "2015-01-21", storeId: 1, productId: 1, categoryId: 1]
];

The objective is to get the result like this:

PRODUCT STATISTICS

Product Id: 1 | Availability Index: 15 + 11 + 11 = 37.
   Longest Available Products (Sort By Past Start Date *first* then, Store Id): 
       1. "2014-12-24" to "2015-01-08" in store id 2. (15 days)
       2. "2014-12-22" to "2015-01-02" in store id 1. (11 days)
       3. "2015-01-10" to "2015-01-21" in store id 1. (11 days)
Product Id: 3 | Availability Index: 7 + 11 = 18.
   Longest Available Products (Sort By Past Start Date *first* then, Store Id): 
       1. "2014-12-22" to "2015-01-02" in store id 1. (11 days)
       2. "2014-12-25" to "2015-01-01" in store id 2. (7 days)

STORE STATISTICS

Store Id: 1 | Availability Index: 11 + 11 + 11 = 33.
   Most Available Product (sort by most available product, then sort by product id):
       1. Product Id: 3 on ["2014-12-22" to "2015-01-02"] (11 days)
       2. Product Id: 1 on ["2014-12-22" to "2015-01-02", "2015-01-10" to "2015-01-21"] (11 days)
Store Id: 2 | Availability Index: 15 + 7 = 22.
   Most Available Product (sort by most available product, then sort by product id):
       1. Product Id: 1 on ["2014-12-24" to "2015-01-08"] (15 days)
       2. Product Id: 3 on ["2014-12-25" to "2015-01-01"] (7 days)

Total Availability Index: 37 + 18 or 33 + 22 = 55.

Here with the printed result above are the product statistics and store statistics. I would seek for the optimized, efficient, and easy to understand solution to print the result above.

My attempt to achieve the results from the above data:

// productAvailability => see the declaration variable above in the beginning of question!
List aggregateDates = productAvailability.collect({[
    storeId: it.storeId,
    productId: it.productId,
    availabilityIndex: Date.parse("YYYY-MM-dd", it.endDate) - Date.parse("YYYY-MM-dd", it.startDate) 
]});
println "Total Availability Index: " + aggregateDates.clone().sum({ it.availabilityIndex });
println "Total Products: " +  aggregateDates.clone().unique({ it.productId }).count({ it.productId });
println "Total Stores: " + aggregateDates.clone().unique({ it.storeId }).count({ it.storeId });
println "Average Availability Index: " + aggregateDates.clone().sum({ it.availabilityIndex }) / aggregateDates.size();

As you can see in the snippet above, I can get very easily the aggregate SUM, AVG, and COUNT how many PRODUCT and STORE are inside the productAvailability data. However, this is difficult for me to get the availability based on PRODUCT and STORE using date ranges to achieve the objective above.

See my code below using date ranges.

def dailyDatesAvailability = [:] as Map<Date, Integer>;
def dailyStoresAvailability = [:].withDefault {0} as Map<Integer, Integer>;
def dailyProductsAvailability  = [:].withDefault {0} as Map<Integer, Integer>;
(Date.parse("YYYY-MM-dd", "2014-12-01")).upto((Date.parse("YYYY-MM-dd", "2015-01-30"))) { Date runningDate ->
        dailyDatesAvailability[runningDate] = 0;
        productAvailability.each({ _availability ->
            def _startDate = Date.parse("YYYY-MM-dd", _availability.startDate);
            def _endDate = Date.parse("YYYY-MM-dd", _availability.endDate);
            if (_startDate <= runningDate && _endDate >= runningDate) {
                dailyDatesAvailability[runningDate]++;
                dailyProductsAvailability[_availability.productId]++;
                dailyStoresAvailability[_availability.storeId]++;
            }

         // Do something here to get the MOST available PRODUCT in a STORE with date ranges
        });
       /// or do something here....?
    }

What is the best way to print the objectives above using Groovy? Please share the code snippet to be able to test.

Upvotes: 1

Views: 133

Answers (1)

tim_yates
tim_yates

Reputation: 171084

Got intrigued by this, and came up with:

List<Range> simplify( List<Range> ranges ) {
  ranges.drop( 1 ).inject( ranges.take( 1 ) ) { r, curr ->
    // Find an overlapping range
    def ov = r.find { curr.from <= it.to && curr.to >= it.from }
    if( ov ) {
      ov.from = [ curr.from, ov.from ].min()
      ov.to   = [ curr.to, ov.to ].max()
      simplify( r )
    }
    else {
      r << curr
    }
  }
}

def manipulate(data, primary, secondary) {
    data.groupBy { it."$primary" }
        .collect { id, vals ->
            def joined = vals.collect { it ->
                [ id: it.id,
                  range: Date.parse('yyyy-MM-dd', it.startDate)..Date.parse('yyyy-MM-dd', it.endDate),
                  key: secondary,
                  value: it."$secondary" ]
            }.groupBy { it.value }
             .collectMany { sid, ran -> simplify(ran.range).collect { [key: secondary, value: sid, range:it, days:(it.to - it.from)] } }
             .sort { a, b -> b.days <=> a.days ?: a.value - b.value }
            [name:primary, id:id, data:joined]
        }
}

def dump(data) {
    data.collect { a ->
        def sum = a.data.days.sum()
        println "$a.name: $a.id | availability index ${a.data.days.join(' + ')} = ${sum}"
        a.data.eachWithIndex { row, idx ->
            println "    ${idx+1}. ${row.range.from.format('yyyy-MM-dd')} to ${row.range.to.format('yyyy-MM-dd')} in $row.key $row.value ($row.days days)"
        }
        sum
    }
}

def productAvailability = [
  [id: 1, startDate: "2014-12-22", endDate: "2015-01-02", storeId: 1, productId: 1, categoryId: 1],
  [id: 4, startDate: "2014-12-24", endDate: "2015-01-08", storeId: 2, productId: 1, categoryId: 1],
  [id: 8, startDate: "2014-12-25", endDate: "2015-01-01", storeId: 2, productId: 3, categoryId: 1],
  [id: 9, startDate: "2014-12-22", endDate: "2015-01-02", storeId: 1, productId: 3, categoryId: 1],
  [id: 10, startDate: "2015-01-10", endDate: "2015-01-21", storeId: 1, productId: 1, categoryId: 1]
];

def p = dump(manipulate(productAvailability, 'productId', 'storeId'))
println ''
def s = dump(manipulate(productAvailability, 'storeId', 'productId'))
println ''
println "Total Availability Index: ${p.join(' + ')} or ${s.join(' + ')} = ${[p.sum(), s.sum()].max()}"

Which prints out:

productId: 1 | availability index 15 + 11 + 11 = 37
    1. 2014-12-24 to 2015-01-08 in storeId 2 (15 days)
    2. 2014-12-22 to 2015-01-02 in storeId 1 (11 days)
    3. 2015-01-10 to 2015-01-21 in storeId 1 (11 days)
productId: 3 | availability index 11 + 7 = 18
    1. 2014-12-22 to 2015-01-02 in storeId 1 (11 days)
    2. 2014-12-25 to 2015-01-01 in storeId 2 (7 days)

storeId: 1 | availability index 11 + 11 + 11 = 33
    1. 2014-12-22 to 2015-01-02 in productId 1 (11 days)
    2. 2015-01-10 to 2015-01-21 in productId 1 (11 days)
    3. 2014-12-22 to 2015-01-02 in productId 3 (11 days)
storeId: 2 | availability index 15 + 7 = 22
    1. 2014-12-24 to 2015-01-08 in productId 1 (15 days)
    2. 2014-12-25 to 2015-01-01 in productId 3 (7 days)

Total Availability Index: 37 + 18 or 33 + 22 = 55

Upvotes: 2

Related Questions