Reputation: 129

Python list group by

I have a python list like this:

Category    Title   ProductId   Rating
'Electronics, Books, Bundles'   Lautner e-Reader Cover  161553  4
'Electronics, Books, Bundles'   Lautner stand in e-Reader Cover 161552  3
'Electronics, Books, Bundles'   Lautner Chocolate NOOK Case 594451  5
'Electronics, Books, Bundles'   Oliver e-Reader Cover   161685  1
'Electronics, Books, Covers'    Dessin Leather Cover for Nook Color; Nook Tablet Digital Reader 594033  4.3
'Electronics, Books, Covers'    Emerson Quote e-Reader Cover    161542  2.8
'Electronics, Books, Covers'    Industriell Easel e-Reader Cover    161682  3.7
'Electronics, Books, Covers'    Jonathan Adler Book Reader Cover Hd - Elephant  594548  4.9
'Electronics, Scanners, Covers' Lyra Light Front Cover for NOOK eR  161683  4
'Electronics, Scanners, Covers' Nook Tablet Dessin Cover in Marine  161686  3.8
'Electronics, Scanners, Covers' Nook Tablet Horizontal Stand Cover in Red   594202  4.2
'Electronics, Scanners, Covers' Canvas Bella Library Cover  161554  3
'Electronics, Books, Radios'    Groovy Protective Stand Cover: Custom Designed for 7-inch NOOK HD   594549  3.8
'Electronics, Books, Radios'    Hd Groovy Stand In Blue- Nook   594514  4.1
'Electronics, Books, Radios'    Hutton Envelope in Bark 161560  2.9
'Electronics, Books, Radios'    Italian Leather-Style Chesterton Cover for NOOK Reader  161561  4

Out of all these list values, I want top k from each category. Top 2 should give the below result:

'Electronics, Books, Bundles'   Lautner Chocolate NOOK Case 594451  5
'Electronics, Books, Bundles'   Lautner e-Reader Cover  161553  4
'Electronics, Books, Covers'    Jonathan Adler Book Reader Cover Hd - Elephant  594548  4.9
'Electronics, Books, Covers'    Dessin Leather Cover for Nook Color; Nook Tablet Digital Reader 594033  4.3
'Electronics, Books, Radios'    Hd Groovy Stand In Blue- Nook   594514  4.1
'Electronics, Books, Radios'    Italian Leather-Style Chesterton Cover for NOOK Reader  161561  4
'Electronics, Scanners, Covers' Nook Tablet Horizontal Stand Cover in Red   594202  4.2
'Electronics, Scanners, Covers' Lyra Light Front Cover for NOOK eR  161683  4

Adding whatever I have tried:

sorted_data = sorted(data, key=operator.itemgetter(1), reverse=True)

k = int(sys.argv[1])
for result in sorted_data[:k]:
    print result

Here I am passing 'k' as a command line argument to the python file.

Upvotes: 2

Answers (4)

Gothami94

Reputation: 11

Assume you are looking for something similar to this. This is the code.

Your list is too long. That is why I used a simple list here. This is the result that I got.

Upvotes: 0

EyuelDK

Reputation: 3199

Using iterators and the like, you can get relatively efficient performance. Note: This uses the standard Python library.

import heapq
import itertools

# group by 'Category'
groups = itertools.groupby(some_list, key=lambda element: element[0])

# take top two of each group based on 'Rating'
top_two_of_each = (heapq.nlargest(2, values, key=lambda value: value[3]) for 
_, values in groups)

# flatten the nested iterators
top_two_of_each_flattened = itertools.chain(*top_two_of_each)

# convert iterator into a list
top_two_of_each_flattened_as_list = list(top_two_of_each_flattened)

Upvotes: 4

DerWeh

Reputation: 1829

Probably not an efficient but understandable solution:

You want the top results per element, so first we need to identify the elements. We do this by splitting at ' as this is the easiest indicator, the empty string from the first ' will be discarded ([1:]).

separated = [element.split("'")[1:] for element in data]

As we are interested in items identified by the first string a dictionary seems like a suitable data structure.

from collections import defaultdict
data_dict = defaultdict(list)
for line in separated:
    data_dict[line[0]].append(line)

Now we have a nice format and can sort the dictonary.

for key in data_dict.keys(): data_dict[key].sort(key=lambda key_string: key_string.split()[-1], reverse=True)

From this dictionary it is easy to reproduce our results:

k = 2
results = []
for key in data_dict.keys():
    results.extend(data_dict[key][:k])

The key is to use a suitable data structure, here a dictionary. Here the short solution:

# make a dict 
from collections import defaultdict
data_dict = defaultdict(list)
for line in data:
    data_dict[line.split("'")[1]].append(line)


# function working on the dict:
def top_results(data_dict, k):
    results = []
    for key in data_dict.keys():
        results.extend(data_dict[key][:k])
    return results

But it is likely more suitable to keep working with an dictionary instead of returning an unsuitable list.

To summarize:

Identify a suitable data structure, here a dict fits.
Obtain your key, split("'") works for this
Reorganize your data in the nice format
Sort you lists, using list.sort. A key is need, here we use just the last word str.split()[-1], as this is your ranking.

Upvotes: 1

zipa

Reputation: 27899

This might be what you need:

data = ''''Electronics, Books, Bundles'   Lautner e-Reader Cover  161553  4
'Electronics, Books, Bundles'   Lautner stand in e-Reader Cover 161552  3
'Electronics, Books, Bundles'   Lautner Chocolate NOOK Case 594451  5
'Electronics, Books, Bundles'   Oliver e-Reader Cover   161685  1
'Electronics, Books, Covers'    Dessin Leather Cover for Nook Color; Nook Tablet Digital Reader 594033  4.3
'Electronics, Books, Covers'    Emerson Quote e-Reader Cover    161542  2.8
'Electronics, Books, Covers'    Industriell Easel e-Reader Cover    161682  3.7
'Electronics, Books, Covers'    Jonathan Adler Book Reader Cover Hd - Elephant  594548  4.9
'Electronics, Scanners, Covers' Lyra Light Front Cover for NOOK eR  161683  4
'Electronics, Scanners, Covers' Nook Tablet Dessin Cover in Marine  161686  3.8
'Electronics, Scanners, Covers' Nook Tablet Horizontal Stand Cover in Red   594202  4.2
'Electronics, Scanners, Covers' Canvas Bella Library Cover  161554  3
'Electronics, Books, Radios'    Groovy Protective Stand Cover: Custom Designed for 7-inch NOOK HD   594549  3.8
'Electronics, Books, Radios'    Hd Groovy Stand In Blue- Nook   594514  4.1
'Electronics, Books, Radios'    Hutton Envelope in Bark 161560  2.9
'Electronics, Books, Radios'    Italian Leather-Style Chesterton Cover for NOOK Reader  161561  4'''


groups = [item.split("' ") for item in data.split('\n')]
grouped_data = {}

for group in groups:
    item = [group[1].strip()]
    group = group[0].strip("'")
    if group not in grouped_data:
        grouped_data[group] = item
    else:
        grouped_data[group] += item

def topN(data, n):
    data = [item.split() for item in data]
    data = sorted(data, key=lambda x: float(x[-1]), reverse=True)[:n]
    data = [' '.join(item) for item in data]
    return data

result = {}
for k, v in grouped_data.items():
    result[k] = topN(v, 2)

final_result = [': '.join([group1, item1]) for group1, value1 in result.items() for item1 in value1]

Upvotes: 1

Python list group by

Answers (4)

Related Questions