Reputation: 5584
Is there a pandas function to transform this data so it show the columns as a,b,c,d,e or whatever is inside the data field and the rows count how many of the letters there are.
import pandas as pd
trans = pd.read_table('output.txt', header=None, index_col=0)
print trans
>>>
1 2 3 4
0
11 a b c NaN
666 a d e NaN
10101 b c d NaN
1010 a b c d
414147 b c NaN NaN
10101 a b d NaN
1242 d e NaN NaN
101 a b c d
411 c d e NaN
444 a b c NaN
instead I want the output to be like this:
a b c d e
0
11 1 1 1 NaN NaN
666 1 NaN NaN 1 1
The function .stack() almost gets it done but in the wrong format.
Upvotes: 5
Views: 1599
Reputation: 64493
You could also use Pandas get_dummies()
pd.get_dummies(df.unstack().dropna()).groupby(level=1).sum()
results in:
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
You could replace the zeros with NaN's in you want to.
Its a bit obscure in one line. df.unstack().dropna()
basically flattens your DataFrame to a series and drops al NaN's. The get_dummies
gives a table of all the occurrences of the letters, but for each level in the unstack DataFrame. The grouping and sum then combine the index to the original shape.
Upvotes: 5
Reputation: 117606
Something like this may be:
>>> st = pd.DataFrame(trans.stack()).reset_index(level=0)
>>> st.columns = ['i','c']
>>> st.pivot_table(rows='i', cols='c', aggfunc=len)
c a b c d e
i
11 1 1 1 NaN NaN
101 1 1 1 1 NaN
411 NaN NaN 1 1 1
444 1 1 1 NaN NaN
666 1 NaN NaN 1 1
1010 1 1 1 1 NaN
1242 NaN NaN NaN 1 1
10101 1 2 1 2 NaN
414147 NaN 1 1 NaN NaN
Upvotes: 2