Arseniy Krupenin
Arseniy Krupenin

Reputation: 3880

How to get unique from dataframe using pandas?

I have df

2016-06-21 06:25:09 [email protected] GET HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 application/json    2130    https://edge-chat.facebook.com/pull?channel=p_100006170407238&seq=27&clientid=1d67ca6e&profile=mobile&partition=-2&sticky_token=185&msgs_recv=27&qp=y&cb=1830997782&state=active&sticky_pool=frc3c09_chat-proxy&uid=100006170407238&viewer_uid=100006170407238&m_sess=&__dyn=1Z3p5wnE-4UpwDF3GAgy78qzoC6Erz8B0GxG9xu3Z0QwFzohxO3O2G2a1mwYxm48sxadwpVEy1qK78gwUx6&__req=79&__ajax__=AYlbtcBwGC2suZLI-J88V0PWa58vtQeG3YlQLydFRsAl6UwLSjsSpD7peu8mGl6NsHvd2zxfDcB6A0-XunBugUsYZ1lMYmUu97R43iV7XSfpyg&__user=100006170407238
2016-06-22 06:25:20 [email protected] POST HTTP/1.1   Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 application/x-javascript    20248   https://m.facebook.com/stories.php?aftercursor=MTQ2NjY2MzEwNToxNDY2NjYzMTA1Ojg6NzM0ODg0MDExMjAyNDY1MzA5NToxNDY2NjYyNzk1OjA%3D&tab=h_nor&__m_log_async__=1
2016-06-23 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-23 06:25:25 [email protected] GET HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 text/html   1105    https://m.facebook.com/xti.php?xt=2.qid.6299270070554694533%3Amf_story_key.343726573953754118%3Aei.AI%40ecf11fb3faf9c0b1f73ce2a74bc9f228
2016-06-24 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443

I need to get unique date to every ID (only year, month and date). Desired output:

[email protected] - 2016-06-21, 2016-06-22, 2016-06-23
[email protected] - 2016-06-24, 2016-06-25

How can I get this date?

Upvotes: 3

Views: 192

Answers (4)

mkos
mkos

Reputation: 428

Let's read your sample data in:

import pandas as pd
import StringIO

df = pd.read_table(StringIO.StringIO("""2016-06-21 06:25:09 [email protected] GET HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 application/json    2130    https://edge-chat.facebook.com/pull?channel=p_100006170407238&seq=27&clientid=1d67ca6e&profile=mobile&partition=-2&sticky_token=185&msgs_recv=27&qp=y&cb=1830997782&state=active&sticky_pool=frc3c09_chat-proxy&uid=100006170407238&viewer_uid=100006170407238&m_sess=&__dyn=1Z3p5wnE-4UpwDF3GAgy78qzoC6Erz8B0GxG9xu3Z0QwFzohxO3O2G2a1mwYxm48sxadwpVEy1qK78gwUx6&__req=79&__ajax__=AYlbtcBwGC2suZLI-J88V0PWa58vtQeG3YlQLydFRsAl6UwLSjsSpD7peu8mGl6NsHvd2zxfDcB6A0-XunBugUsYZ1lMYmUu97R43iV7XSfpyg&__user=100006170407238
2016-06-22 06:25:20 [email protected] POST HTTP/1.1   Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 application/x-javascript    20248   https://m.facebook.com/stories.php?aftercursor=MTQ2NjY2MzEwNToxNDY2NjYzMTA1Ojg6NzM0ODg0MDExMjAyNDY1MzA5NToxNDY2NjYyNzk1OjA%3D&tab=h_nor&__m_log_async__=1
2016-06-23 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-23 06:25:25 [email protected] GET HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 text/html   1105    https://m.facebook.com/xti.php?xt=2.qid.6299270070554694533%3Amf_story_key.343726573953754118%3Aei.AI%40ecf11fb3faf9c0b1f73ce2a74bc9f228
2016-06-24 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1    Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53   200 -   0   scontent.xx.fbcdn.net:443
"""), delim_whitespace=True, header=None)

You are interested in first (index: 0) column, which is date and third (index:2) which is email addr. Purely for visibility reasons, let's isolate them in new data frame:

df2 = df[[0, 2]]

which is now:

            0                2
0  2016-06-21  [email protected]
1  2016-06-22  [email protected]
2  2016-06-23  [email protected]
3  2016-06-23  [email protected]
4  2016-06-24  [email protected]
5  2016-06-25  [email protected]
6  2016-06-25  [email protected]

we now need to group them and aggregate with custom function which will turn aggregated dates into list (like your desired output):

df2.groupby(2).agg(lambda x: x.unique().tolist()).reset_index()

reset_index() fixes indexing so w get following data frame:

                 2                                     0
0  [email protected]              [2016-06-24, 2016-06-25]
1  [email protected]  [2016-06-21, 2016-06-22, 2016-06-23]

Upvotes: 1

ysearka
ysearka

Reputation: 3855

You can first extract the info you need from your dates:

df['filtered date'] = [w[:10] for w in df['date']]

Then you use a `drop duplicates':

output = df[['id','filtered date']].drop_duplicates()

You can then reorder your data frame for clarity:

output.sort_values(by['id','filtered date'],inplace = True)

You'll finally get this kind of output:

    id               filtered date
0   [email protected]  2016-06-24
1   [email protected]  2016-06-25
3   [email protected]  2016-06-21
4   [email protected]  2016-06-22
5   [email protected]  2016-06-23

Upvotes: 2

Alicia Garcia-Raboso
Alicia Garcia-Raboso

Reputation: 13913

Here's a one-liner (supposing date and ID as the names of the relevant columns)

df.groupby('ID').apply(lambda x: (x['date'].str[:10]).unique())

and its output

ID
[email protected]                [2016-06-24, 2016-06-25]
[email protected]    [2016-06-21, 2016-06-22, 2016-06-23]
dtype: object

Upvotes: 1

jbndlr
jbndlr

Reputation: 5210

Pandas provides the function groupby for DataFrames, which should be suitable for what you require.

# Generate dataframe with random values
mail  = ['[email protected]', '[email protected]', '[email protected]']
stime = datetime.strptime('2016-07-01 00:00:00', '%Y-%m-%d %H:%M:%S')
etime = datetime.strptime('2016-07-30 00:00:00', '%Y-%m-%d %H:%M:%S')
tdelta = etime - stime
tdiff = tdelta.days * 24 * 60 * 60 + tdelta.seconds

df = pd.DataFrame({
    'mail': [choice(mail) for _ in range(10)],
    'time':[stime + timedelta(seconds=randrange(tdiff)) for _ in range(10)]
})

# Group dataframe by column 'mail' and apply the lambda expression to
# transform the grouped set of values into unique time values.
r = df.groupby(by='mail').apply(lambda x: set(x['time'].values))

Then, you should be able to work with the result:

print(r)

mail
[email protected]    {2016-07-24T16:42:12.000000000, 2016-07-07T15:...
[email protected]      {2016-07-13T18:53:07.000000000, 2016-07-04T06:...
[email protected]       {2016-07-10T07:37:19.000000000, 2016-07-09T07:...
dtype: object

Upvotes: 1

Related Questions