tktktk0711
tktktk0711

Reputation: 1694

Python pandas: dataframe grouped by a column(such as, name), and get the value of some columns in each group

There is dataframe called as df as following:

  name   id    age             text 
   a      1     1    very good, and I like him
   b      2     2    I play basketball with his brother
   c      3     3    I hope to get a offer
   d      4     4    everything goes well, I think
   a      1     1    I will visit china
   b      2     2    no one can understand me, I will solve it
   c      3     3    I like followers
   d      4     4    maybe I will be good
   a      1     1    I should work hard to finish my research
   b      2     2    water is the source of earth, I agree it
   c      3     3    I hope you can keep in touch with me
   d      4     4    My baby is very cute, I like him

You know, there are four names: a, b, c, d. and each name has id, age, and text. Actually there id, age for each name group are the same, but the text is different for each name group, each name has three rows(this just example, the real data is a large data)

I want to get the id, age for each name group (for example). In addition, I want to caculate the character index in all text for each group in the text by the function: extract_text(text). I mean I want to get the following data: take the name 'a' as example: age: 1, id: 1. 'I' index in three rows(I just give a example, not the real): 20, 0, 0.

I have tried to do as following:

 import  pandas as pd

 def extract_text(text):
     index_n = None
     text_len = len(text)
     for i in range(0, text_len, 1):
         if text[i] == 'I':
            index_n = i
     return index_n



 df = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd',     
                            'a', 'b', 'c', 'd'],
               'id': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'age':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
               'text':['very good, and I like him', 
                       'I play basketball with his brother',
                       'I hope to get a offer', 
                       'everything goes well, I think',
                       'I will visit china', 
                       'no one can understand me, I will solve it',
                       'I like followers', 'maybe I will be good',
                       'I should work hard to finish my research',                 
                       'water is the source of earth, I agree it',
                       'I hope you can keep in touch with me', 
                       'My baby is very cute, I like him']})


  id_num = df.groupby('name')['id'].value[0]
  id_num = df.groupby('age')['id'].value[0]
  index_num = df.groupby('age')['text'].apply(extract_text)

But there is error:

Traceback (most recent call last):File
bot_test_new.py", line 25, in
id_num = df.groupby('name')['id'].value[0]
AttributeError: 'SeriesGroupBy' object has no attribute 'value'

Please give me you hand, thanks in advance!

Upvotes: 2

Views: 3506

Answers (2)

jezrael
jezrael

Reputation: 862511

I think you can use str.find:

print (df.groupby('age')['text'].apply(lambda x: x.str.find('I').tolist()))
age
1     [15, 0, 0]
2    [0, 26, 30]
3      [0, 0, 0]
4    [22, 6, 22]
Name: text, dtype: object

If need id_num use iloc:

id_num = df.groupby('name')['id'].apply(lambda x: x.iloc[0])
print (id_num)
name
a    1
b    2
c    3
d    4
Name: id, dtype: int64

But it looks like you can use only:

df['position'] = df['text'].str.find('I')

print (df)
    age  id name                                       text  position
0     1   1    a                  very good, and I like him        15
1     2   2    b         I play basketball with his brother         0
2     3   3    c                      I hope to get a offer         0
3     4   4    d              everything goes well, I think        22
4     1   1    a                         I will visit china         0
5     2   2    b  no one can understand me, I will solve it        26
6     3   3    c                           I like followers         0
7     4   4    d                       maybe I will be good         6
8     1   1    a   I should work hard to finish my research         0
9     2   2    b   water is the source of earth, I agree it        30
10    3   3    c       I hope you can keep in touch with me         0
11    4   4    d           My baby is very cute, I like him        22

Upvotes: 1

Javier
Javier

Reputation: 420

I'll elaborate a bit more than in the comment. The problem is that extract_text is only able to handle individual strings. However when you groupby and then apply, you're sending a list with all the strings in the group.

There are two solutions, the first is the one I indicated (sending individual strings):

index_num = df.groupby('age')['text'].apply(lambda x: [extract_text(_) for _ in x]) 

The other is changing extract_text so it can handle the list of strings:

 def extract_text(list_texts):
    list_index = []
    for text in list_texts:
        index_n = None
        text_len = len(text)
        for i in range(0, text_len, 1):
            if text[i] == 'I':
                index_n = i
        list_index.append(index_n)
    return list_index

And then continue with:

index_num = df.groupby('age')['text'].apply(extract_text)

Moreover, you can use text.find("I") instead of your loop inside extract_text. Something like this def extract_text(list_texts): return [text.find("I") for text in list_texts].

Upvotes: 1

Related Questions