Reputation: 864
I have a question related to the df['columnA'].value_counts()
method and a previous post here: Count frequency of values in pandas DataFrame column
Take this example DataFrame:
fake_data = {'columnA': ['XAVY', 'XAVY', 'XAVY', 'XAVY', 'XAVY', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV', 'AXYV']}
df = pd.DataFrame(fake_data, columns = ['columnA'])
df
I am trying to determine the frequency of each letter (X,A,V,Y) at each position in the string in this column.
In this example, position 0 would be 54% A, 46% X, position 3 would be 46% Y, 54% V...and so on.
Upvotes: 2
Views: 1871
Reputation: 1406
Maybe this helps:
new_data = df.columnA.str.split('',n=4, expand=True).drop(0, axis=1)
stats = new_data.apply(pd.Series.value_counts)
stats = stats.apply(lambda x: (x/x.sum())*100).round(2).fillna(0)
print(stats)
Output
1 2 3 4
A 54.54 45.45 0 0
V 0 0 45.45 54.54
X 45.45 54.54 0 0
Y 0 0 54.54 45.45
Upvotes: 2
Reputation: 19947
First convert the string to a list and then expand to columns. Then count values and calculate percentage of occurence for each letter.
(
df.columnA.apply(list)
.apply(pd.Series)
.apply(pd.value_counts)
.apply(lambda x: x.div(x.sum()))
.fillna(0)
)
0 1 2 3
A 0.545455 0.454545 0.000000 0.000000
V 0.000000 0.000000 0.454545 0.545455
X 0.454545 0.545455 0.000000 0.000000
Y 0.000000 0.000000 0.545455 0.454545
Upvotes: 4
Reputation: 7361
This expand the linked post:
ddf = df['columnA'].apply(lambda x : pd.Series(list(x)))
counts = ddf[c].value_counts() / ddf[c].value_counts().sum() for c in ddf.columns]
counts
is a list with a series produced by value_counts()
for each column. So to check position 0, you do counts[0]
. And it will gives:
A 0.545455
X 0.454545
Name: 0, dtype: float64
Upvotes: 2
Reputation: 2405
Something to begin with
>>> df.columnA.str.split('', expand=True).apply(pd.Series.value_counts)
0 1 2 3 4 5
11.0 NaN NaN NaN NaN 11.0
A NaN 6.0 5.0 NaN NaN NaN
V NaN NaN NaN 5.0 6.0 NaN
X NaN 5.0 6.0 NaN NaN NaN
Y NaN NaN NaN 6.0 5.0 NaN
This way columns at 0 and 5th positions are empty, and can be ignored.
Upvotes: 0