Reputation: 61
Through a series of functions raking HTML and finding the text and then finding keywords and score, I end up with a tuple that looks like this:
test_new = extract_keywords(test_test)
('keywords: ',
[('single high-level impulse noise', 23.5),
('cable replacement programme failed', 16.0),
('meet current british standards', 16.0),
('engineer michael jones', 8.333333333333334),
('18 months engineers began', 8.25),
('embarrassed householder promised', 8.0),
('second-hand television', 8.0),
('openreach chief engineer', 7.75),
('electrical interference emitted', 7.583333333333334),
('entire village lost', 7.0),
('stable broadband signal', 6.714285714285714),
('problem television fixed', 6.6),
('electrical noise', 5.75),
('electrical interference', 4.583333333333334),
('mr jones', 4.333333333333334),
('engineers discovered', 4.25))
I thought I could use Counter to find the n largest values but that doesn't seem to work on tuples. I tried slicing it with test_new[:3] to get the top values since its already ordered but that didn't work, either.
Ideally I need to pass it through a function:
def top_keywords(rake_keywords, n=3):
#get top n keywords
return
where I can return the values based on the n value. attempted:
sorted(test_new, key=lambda t: t[1], reverse=True)[:5]
but got
'<' not supported between instances of 'str' and 'tuple'
Upvotes: 1
Views: 106
Reputation: 584
If your extract_results
function returned ('keywords:', [<your actual dataset>])
, where the actual dataset is inside the tuple, then it's simply a matter of indexing the dataset with test_new[1]
and throwing that into your sorted
code instead of the entire tuple:
sorted(test_new[1], key=lambda t: t[1], reverse=True)[:5]
However, I think this is a problem that is stemming from your extract_results
function. If I were to guess, your extract_results
function had this as a return statement:
return 'keywords: ', keywords
If this is the case, this obscures your real data because the function now returns a tuple containing the string "keywords: " and then the actual keywords, and you now have to index the tuple to obtain the data. You don't need to write in the return statement that it's "keywords" being given; your function and return keywords
self-documents that. Replace the line with return keywords
and you can run sorted
how it was normally, without needing to write test_new[1]
.
If you would like help in turning the sorted
statement into a function, the other answers have that.
Coming from your original question, I originally assumed the problem to be with the dataset itself. With your clarification on what the data looks like, it appears this isn't the case.
Upvotes: 1
Reputation: 8508
If you want to create a function that gets you the top n number of items from the tuple, then use can use the below function:
def top_n_tups (tups, n=3):
sorted_tup = sorted(tups, key=lambda tup: tup[1], reverse=True)
return sorted_tup[:n]
top_n_tups(test_new[1])
This will provide a result set as shown below. Assumption is that this is a tuple with a list of tuples inside it.
[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0)]
You can also call the function with a value of n. If there is no n, it will default to top 3. If you give n=6, then top 6. Example below shows that.
>>> top_n_tups(test_new[1],6)
[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0), ('engineer michael jones', 8.333333333333334), ('18 months engineers began', 8.25), ('embarrassed householder promised', 8.0)]
If you are storing the tuple into a variable like this, then you can use index to retrieve them.
test_new = ('keywords: ',
[('single high-level impulse noise', 23.5),
('cable replacement programme failed', 16.0),
('meet current british standards', 16.0),
('engineer michael jones', 8.333333333333334),
('18 months engineers began', 8.25),
('embarrassed householder promised', 8.0),
('second-hand television', 8.0),
('openreach chief engineer', 7.75),
('electrical interference emitted', 7.583333333333334),
('entire village lost', 7.0),
('stable broadband signal', 6.714285714285714),
('problem television fixed', 6.6),
('electrical noise', 5.75),
('electrical interference', 4.583333333333334),
('mr jones', 4.333333333333334),
('engineers discovered', 4.25)])
then you can use something like this:
>>> test_new[1][:3]
[('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0), ('meet current british standards', 16.0)]
you can also get to the specific value like this:
>>> test_new[1][0][0]
'single high-level impulse noise'
>>> test_new[1][0][1]
23.5
However, if the data does not have a list and only contains tuples like this, then you can retrieve it much easier.
>>> test_new = ('keywords: ',
('single high-level impulse noise', 23.5),
('cable replacement programme failed', 16.0),
('meet current british standards', 16.0),
('engineer michael jones', 8.333333333333334),
('18 months engineers began', 8.25),
('embarrassed householder promised', 8.0),
('second-hand television', 8.0),
('openreach chief engineer', 7.75),
('electrical interference emitted', 7.583333333333334),
('entire village lost', 7.0),
('stable broadband signal', 6.714285714285714),
('problem television fixed', 6.6),
('electrical noise', 5.75),
('electrical interference', 4.583333333333334),
('mr jones', 4.333333333333334),
('engineers discovered', 4.25))
Then you can retrieve it as follows:
>>> test_new[1]
('single high-level impulse noise', 23.5)
>>> test_new[:3]
('keywords: ', ('single high-level impulse noise', 23.5), ('cable replacement programme failed', 16.0))
Note that test_num[0]
is 'keywords: '
Upvotes: 1
Reputation: 476
Your sample data was missing a closing ]
on the list , but it looks like you were on the right track with your first try at slicing:
test_new[1][:3]
Gives you the top 3 tuples, then you just need to extract the keywords from that:
top_keywords = [kw[0] for kw in test_new[1][:3]]
Or to break it down into a function:
def top_keywords(rake_keywords, n=3):
keyword_list = rake_keywords[1]
top_keyword_items = keyword_list[:n]
top_keywords = [kw[0] for kw in top_keyword_items]
return top_keywords
Upvotes: 1
Reputation: 360
if storing the value of test_new like this:
test_new = ('keywords: ', [
('single high-level impulse noise', 23.5),
('cable replacement programme failed', 16.0),
('meet current british standards', 16.0),
('engineer michael jones', 8.333333333333334),
('18 months engineers began', 8.25),
('embarrassed householder promised', 8.0),
('second-hand television', 8.0),
('openreach chief engineer', 7.75),
('electrical interference emitted', 7.583333333333334),
('entire village lost', 7.0),
('stable broadband signal', 6.714285714285714),
('problem television fixed', 6.6),
('electrical noise', 5.75),
('electrical interference', 4.583333333333334),
('mr jones', 4.333333333333334),
('engineers discovered', 4.25)
])
then you can do:
def top_keywords(rake_keywords, n=3):
return sorted(rake_keywords[1], key=lambda t: t[1], reverse=True)[:n]
Upvotes: 1
Reputation: 5521
I thought I could use Counter to find the n largest values but that doesn't seem to work on tuples.
It does work on dict, which does work on tuples:
Counter(dict(test_new[1])).most_common(3)
Upvotes: 1