blkngoldbudda
blkngoldbudda

Reputation: 61

Getting top n results from a tuple

Through a series of functions raking HTML and finding the text and then finding keywords and score, I end up with a tuple that looks like this:

test_new = extract_keywords(test_test)

('keywords: ',
 [('single high-level impulse noise', 23.5),
  ('cable replacement programme failed', 16.0),
  ('meet current british standards', 16.0),
  ('engineer michael jones', 8.333333333333334),
  ('18 months engineers began', 8.25),
  ('embarrassed householder promised', 8.0),
  ('second-hand television', 8.0),
  ('openreach chief engineer', 7.75),
  ('electrical interference emitted', 7.583333333333334),
  ('entire village lost', 7.0),
  ('stable broadband signal', 6.714285714285714),
  ('problem television fixed', 6.6),
  ('electrical noise', 5.75),
  ('electrical interference', 4.583333333333334),
  ('mr jones', 4.333333333333334),
  ('engineers discovered', 4.25))

I thought I could use Counter to find the n largest values but that doesn't seem to work on tuples. I tried slicing it with test_new[:3] to get the top values since its already ordered but that didn't work, either.

Ideally I need to pass it through a function:

def top_keywords(rake_keywords, n=3):

#get top n keywords
return

where I can return the values based on the n value. attempted:

sorted(test_new, key=lambda t: t[1], reverse=True)[:5]

but got

'<' not supported between instances of 'str' and 'tuple'

Upvotes: 1

Views: 106

Answers (5)

thegamecracks
thegamecracks

Reputation: 584

If your extract_results function returned ('keywords:', [<your actual dataset>]), where the actual dataset is inside the tuple, then it's simply a matter of indexing the dataset with test_new[1] and throwing that into your sorted code instead of the entire tuple:

sorted(test_new[1], key=lambda t: t[1], reverse=True)[:5]

However, I think this is a problem that is stemming from your extract_results function. If I were to guess, your extract_results function had this as a return statement:

return 'keywords: ', keywords

If this is the case, this obscures your real data because the function now returns a tuple containing the string "keywords: " and then the actual keywords, and you now have to index the tuple to obtain the data. You don't need to write in the return statement that it's "keywords" being given; your function and return keywords self-documents that. Replace the line with return keywords and you can run sorted how it was normally, without needing to write test_new[1].

If you would like help in turning the sorted statement into a function, the other answers have that.

Coming from your original question, I originally assumed the problem to be with the dataset itself. With your clarification on what the data looks like, it appears this isn't the case.

Upvotes: 1

Joe Ferndz
Joe Ferndz

Reputation: 8508

function to get top n items from a tuple

If you want to create a function that gets you the top n number of items from the tuple, then use can use the below function:

def top_n_tups (tups, n=3):
    sorted_tup = sorted(tups, key=lambda tup: tup[1], reverse=True)
    return sorted_tup[:n]

top_n_tups(test_new[1])

This will provide a result set as shown below. Assumption is that this is a tuple with a list of tuples inside it.

[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0)]

You can also call the function with a value of n. If there is no n, it will default to top 3. If you give n=6, then top 6. Example below shows that.

>>> top_n_tups(test_new[1],6)

[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0), ('engineer michael jones', 8.333333333333334), ('18 months engineers began', 8.25), ('embarrassed householder promised', 8.0)]

tuple contains a list of tuples

If you are storing the tuple into a variable like this, then you can use index to retrieve them.

test_new = ('keywords: ',
 [('single high-level impulse noise', 23.5),
  ('cable replacement programme failed', 16.0),
  ('meet current british standards', 16.0),
  ('engineer michael jones', 8.333333333333334),
  ('18 months engineers began', 8.25),
  ('embarrassed householder promised', 8.0),
  ('second-hand television', 8.0),
  ('openreach chief engineer', 7.75),
  ('electrical interference emitted', 7.583333333333334),
  ('entire village lost', 7.0),
  ('stable broadband signal', 6.714285714285714),
  ('problem television fixed', 6.6),
  ('electrical noise', 5.75),
  ('electrical interference', 4.583333333333334),
  ('mr jones', 4.333333333333334),
  ('engineers discovered', 4.25)])

then you can use something like this:

>>> test_new[1][:3]
[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0)]

you can also get to the specific value like this:

>>> test_new[1][0][0]
'single high-level impulse noise'

>>> test_new[1][0][1]
23.5

contains only tuples

However, if the data does not have a list and only contains tuples like this, then you can retrieve it much easier.

>>> test_new = ('keywords: ',
  ('single high-level impulse noise', 23.5),
  ('cable replacement programme failed', 16.0),
  ('meet current british standards', 16.0),
  ('engineer michael jones', 8.333333333333334),
  ('18 months engineers began', 8.25),
  ('embarrassed householder promised', 8.0),
  ('second-hand television', 8.0),
  ('openreach chief engineer', 7.75),
  ('electrical interference emitted', 7.583333333333334),
  ('entire village lost', 7.0),
  ('stable broadband signal', 6.714285714285714),
  ('problem television fixed', 6.6),
  ('electrical noise', 5.75),
  ('electrical interference', 4.583333333333334),
  ('mr jones', 4.333333333333334),
  ('engineers discovered', 4.25))

Then you can retrieve it as follows:

>>> test_new[1]
('single high-level impulse noise', 23.5)

>>> test_new[:3]
('keywords: ', ('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0))

Note that test_num[0] is 'keywords: '

Upvotes: 1

John S
John S

Reputation: 476

Your sample data was missing a closing ] on the list , but it looks like you were on the right track with your first try at slicing:

test_new[1][:3]

Gives you the top 3 tuples, then you just need to extract the keywords from that:

top_keywords = [kw[0] for kw in test_new[1][:3]]

Or to break it down into a function:

def top_keywords(rake_keywords, n=3):
    keyword_list = rake_keywords[1]
    top_keyword_items = keyword_list[:n]
    top_keywords = [kw[0] for kw in top_keyword_items]
    return top_keywords

Upvotes: 1

if storing the value of test_new like this:

test_new = ('keywords: ', [
    ('single high-level impulse noise', 23.5),
    ('cable replacement programme failed', 16.0),
    ('meet current british standards', 16.0),
    ('engineer michael jones', 8.333333333333334),
    ('18 months engineers began', 8.25),
    ('embarrassed householder promised', 8.0),
    ('second-hand television', 8.0),
    ('openreach chief engineer', 7.75),
    ('electrical interference emitted', 7.583333333333334),
    ('entire village lost', 7.0),
    ('stable broadband signal', 6.714285714285714),
    ('problem television fixed', 6.6),
    ('electrical noise', 5.75),
    ('electrical interference', 4.583333333333334),
    ('mr jones', 4.333333333333334),
    ('engineers discovered', 4.25)
])

then you can do:

def top_keywords(rake_keywords, n=3):
    return sorted(rake_keywords[1], key=lambda t: t[1], reverse=True)[:n]

Upvotes: 1

superb rain
superb rain

Reputation: 5521

I thought I could use Counter to find the n largest values but that doesn't seem to work on tuples.

It does work on dict, which does work on tuples:

Counter(dict(test_new[1])).most_common(3)

Upvotes: 1

Related Questions