user15416151
user15416151

Reputation:

Splitting and counting the frequency of the elements of a Pandas column

I have a hanging column that looks like this:

['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']

I'm trying to do two things:

Count the frequency of each item like this:

A = 5
As = 1
a = 1
B = 3
C = 1
E = 1
e = 2

I tried to make a group, but I don't get the result I hoped for

df.groupby(['Priority']).Priority.count())

Then I want to group A with a with As and group E with e

Upvotes: 3

Views: 1068

Answers (3)

Samir Hinojosa
Samir Hinojosa

Reputation: 825

Try with the something as follows

from collections import Counter  

def most_common_words(labels, quantity):
    """
    Split all words present in list and count how many times it
    is repeated in the list. 
    Args:
        labels (list): List of strings to split.
        quantity (int): Amount of most common words to return.
    Returns:
        counter (liste): List of words splitted with its number of ocurrences.
    """
    #words = [i.split(" ", 3)[0] for i in labels]
    #counter = Counter(words).most_common(quantity)
    words = [(re.split('(;|,|-| |\*|\n)', i)) for i in labels]
    counter = Counter(x for xs in words for x in set(xs)).most_common(quantity)
    df = pd.DataFrame(counter, columns=["Word", "Occurence number"])\
                        .sort_values(by="Occurence number", ascending=True)
    
    df = df[df["Word"] != " "].reset_index(drop=True)
    
    return df
    
df_most_common_words = most_common_words(data_copy["col"].tolist(), 20)
print(df_most_common_words)

the output

            Word  Occurence number
19    Repetition              8946
18    Government              9159
17       SACMEQ:             11502
16         Gross             12993
15        PIAAC:             20874
14         PISA:             21087
13        TIMSS:             21300
12        Africa             21513
11     Enrolment             21939

In your case, you can do something as follows.

col_a = ['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']
df = pd.DataFrame(col_a, columns=['col_a'])
df
    col_a
0   A
1   B
2   A-E
3   a
4   A;e
5   B;e
6   A;B
7   C A
8   As

df['col_a'] = df['col_a'].str.replace('-',' ').str.replace(';',' ')
df
    col_a
0   A
1   B
2   A E
3   a
4   A e
5   B e
6   A B
7   C A
8   As

df_most_common_words = most_common_words(df["col_a"].tolist(), 20)
df_most_common_words    
  Word  Occurence number
0   E   1
1   a   1
2   C   1
3   As  1
4   e   2
5   B   3
6   A   5

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153510

You can try this:

import pandas as pd
a = pd.Series(['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As'])
a.str.split('\W').explode().value_counts()

Output:

A     5
B     3
e     2
E     1
As    1
C     1
a     1
dtype: int64

Upvotes: 2

BENY
BENY

Reputation: 323326

Since we request the range of chr and Counter

l = ['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']
seq = [[chr(y) for y in range(ord(x.split('-')[0]), ord(x.split('-')[1]) + 1)] if '-' in x else re.split('\s|\;',x) for x in l ]
out = Counter(x for xs in seq for x in set(xs))
Out[400]: Counter({'A': 5, 'B': 4, 'D': 1, 'C': 2, 'E': 1, 'a': 1, 'e': 2, 'As': 1})

Or if the - dose not mean range , we can simply

seq = [ re.split('\s|\;|\-',x) for x in l ]
Counter(x for xs in seq for x in set(xs))
Out[402]: Counter({'A': 5, 'B': 3, 'E': 1, 'a': 1, 'e': 2, 'C': 1, 'As': 1}) 

Upvotes: 1

Related Questions