Reputation:
I have a hanging column that looks like this:
['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']
I'm trying to do two things:
Count the frequency of each item like this:
A = 5
As = 1
a = 1
B = 3
C = 1
E = 1
e = 2
I tried to make a group, but I don't get the result I hoped for
df.groupby(['Priority']).Priority.count())
Then I want to group A with a with As and group E with e
Upvotes: 3
Views: 1068
Reputation: 825
Try with the something as follows
from collections import Counter
def most_common_words(labels, quantity):
"""
Split all words present in list and count how many times it
is repeated in the list.
Args:
labels (list): List of strings to split.
quantity (int): Amount of most common words to return.
Returns:
counter (liste): List of words splitted with its number of ocurrences.
"""
#words = [i.split(" ", 3)[0] for i in labels]
#counter = Counter(words).most_common(quantity)
words = [(re.split('(;|,|-| |\*|\n)', i)) for i in labels]
counter = Counter(x for xs in words for x in set(xs)).most_common(quantity)
df = pd.DataFrame(counter, columns=["Word", "Occurence number"])\
.sort_values(by="Occurence number", ascending=True)
df = df[df["Word"] != " "].reset_index(drop=True)
return df
df_most_common_words = most_common_words(data_copy["col"].tolist(), 20)
print(df_most_common_words)
the output
Word Occurence number
19 Repetition 8946
18 Government 9159
17 SACMEQ: 11502
16 Gross 12993
15 PIAAC: 20874
14 PISA: 21087
13 TIMSS: 21300
12 Africa 21513
11 Enrolment 21939
In your case, you can do something as follows.
col_a = ['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']
df = pd.DataFrame(col_a, columns=['col_a'])
df
col_a
0 A
1 B
2 A-E
3 a
4 A;e
5 B;e
6 A;B
7 C A
8 As
df['col_a'] = df['col_a'].str.replace('-',' ').str.replace(';',' ')
df
col_a
0 A
1 B
2 A E
3 a
4 A e
5 B e
6 A B
7 C A
8 As
df_most_common_words = most_common_words(df["col_a"].tolist(), 20)
df_most_common_words
Word Occurence number
0 E 1
1 a 1
2 C 1
3 As 1
4 e 2
5 B 3
6 A 5
Upvotes: 2
Reputation: 153510
You can try this:
import pandas as pd
a = pd.Series(['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As'])
a.str.split('\W').explode().value_counts()
Output:
A 5
B 3
e 2
E 1
As 1
C 1
a 1
dtype: int64
Upvotes: 2
Reputation: 323326
Since we request the range of chr
and Counter
l = ['A', 'B', 'A-E', 'a', 'A;e', 'B;e', 'A;B', 'C A', 'As']
seq = [[chr(y) for y in range(ord(x.split('-')[0]), ord(x.split('-')[1]) + 1)] if '-' in x else re.split('\s|\;',x) for x in l ]
out = Counter(x for xs in seq for x in set(xs))
Out[400]: Counter({'A': 5, 'B': 4, 'D': 1, 'C': 2, 'E': 1, 'a': 1, 'e': 2, 'As': 1})
Or if the - dose not mean range , we can simply
seq = [ re.split('\s|\;|\-',x) for x in l ]
Counter(x for xs in seq for x in set(xs))
Out[402]: Counter({'A': 5, 'B': 3, 'E': 1, 'a': 1, 'e': 2, 'C': 1, 'As': 1})
Upvotes: 1