vollkorn
vollkorn

Reputation: 85

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:

import dask.dataframe as dd

df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())

The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:

                  domain_name  domain_length
0                webmagnat.ro             12
1     nickelfreesolutions.com             23
2  scheepvaarttelefoongids.nl             26
3                  tursan.net             10
4       plannersanonymous.com             21

domain_name       object
domain_length    float64
dtype: object

Traceback (most recent call last):
  File "nlargest_test.py", line 9, in <module>
    print(top_3.head())
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
    result = result.compute()
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
    return compute(self, **kwargs)[0]
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
    **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
    raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object

Traceback
---------
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
    f = lambda df: df.nlargest(n, columns)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
    return self._nsorted(columns, n, 'nlargest', keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
    ser = getattr(self[columns[0]], method)(n, keep=keep)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
    return algos.select_n(self, n=n, keep=keep, method='nlargest')
  File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
    raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))

I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?

Upvotes: 2

Views: 7414

Answers (5)

guzel6031
guzel6031

Reputation: 11

This is how my first data frame look.

This is how my new data frame looks after getting top 5.

'''

station_count.nlargest(5,'count')

'''

You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count. Always top n number followed by its corresponding column that is int type.

Upvotes: 1

Leonardo Mallmann
Leonardo Mallmann

Reputation: 1

If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.

df['your_column'].value_counts().nlargest(3)

It will bring the top 3 occurrences from that column.

Upvotes: 0

Nouman Tariq
Nouman Tariq

Reputation: 46

You only need to change the type of respective column to int or float using .astype().

For example, in your case:

top_3 = df['domain_length'].astype(float).nlargest(3)

Upvotes: 0

nnaqa
nnaqa

Reputation: 269

I was helped by explicit type conversion:

df['column'].astype(str).astype(float).nlargest(5)

Upvotes: 3

MRocklin
MRocklin

Reputation: 57301

I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?

Pandas example

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: df['y'] = df.x.map(len)

In [4]: df
Out[4]: 
      x  y
0     a  1
1    bb  2
2   ccc  3
3  dddd  4

In [5]: df.nlargest(3, 'y')
Out[5]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

Dask dataframe example

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf['y'] = ddf.x.map(len)

In [6]: ddf.nlargest(3, 'y').compute()
Out[6]: 
      x  y
3  dddd  4
2   ccc  3
1    bb  2

Alternatively, perhaps this is just working now on the git master version?

Upvotes: 0

Related Questions