Arun
Arun

Reputation: 669

Dynamic creation of count table in pandas

I want to create a count table from row data with entries separated by tab. For example,

Input text file:

   L1 A  A  B  A  C  D
   L2 B  B  D  E  E
   L3 F  G

The output I expect is the following table:

Output table file:

   A B C D E F G
L1 3 1 1 1 0 0 0
L2 0 2 0 1 2 0 0
L3 0 0 0 0 0 1 1

This is the algorithm I thought about:

  1. Create pandas dataframe with col headers ['A','B'..'G']
  2. Read text file in with open() by readlines to read line by line
  3. Split with tab
  4. If col header matches with entry, assign value count ++

Please let me know, if there is a better idea.

Thanks,

AP

Upvotes: 1

Views: 334

Answers (3)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

A bit of hacky way is to use get_dummies and groupby i.e

For reading the text file we can do

df = pd.read_csv('untitled.txt',header=None)
df = df[0].str.split(expand=True).set_index(0)

#This will lead to
#    1  2     3     4     5     6
# 0                               
# L1  A  A     B     A     C     D
# L2  B  B     D     E     E  None
# L3  F  G  None  None  None  None

Then we can make use of pd.get_dummies()

Option 1:

dummies= pd.get_dummies(df)
dummies.groupby(dummies.columns.str[-1],axis=1).sum()

Option 2: (for quickness use stack and value_counts i.e )

df.stack().groupby(level=0).value_counts().unstack().fillna(0)
    A  B  C  D  E  F  G
0                      
L1  3  1  1  1  0  0  0
L2  0  2  0  1  2  0  0
L3  0  0  0  0  0  1  1

Upvotes: 2

chthonicdaemon
chthonicdaemon

Reputation: 19760

Assuming you have a text file like the one you pasted, you can read it with

df = pandas.read_table('input.txt', delimiter=' *', 
                       header=None, engine='python', index_col=0)

Once you have the DataFrame, you can calculate the counts like this:

result = df.apply(pandas.Series.value_counts, axis=1).fillna(0)

Which results in this:

      A    B    C    D    E    F    G
0                                    
L1  3.0  1.0  1.0  1.0  0.0  0.0  0.0
L2  0.0  2.0  0.0  1.0  2.0  0.0  0.0
L3  0.0  0.0  0.0  0.0  0.0  1.0  1.0

Upvotes: 1

jezrael
jezrael

Reputation: 862611

I think you can loop by each row and use Counter for dictionaries and last DataFrame constructor.

Here read_csv should failed if not same lengths of values in each row.

import csv
from collections import Counter

with open('a.txt', 'r') as f:
    d = {row[0]: Counter([x for x in row[1:]]) for row in csv.reader(f, delimiter='\t')}

print (d)
{'L2': Counter({'B': 2, 'E': 2, 'D': 1}), 
 'L1': Counter({'A': 3, 'B': 1, 'C': 1, 'D': 1}),
 'L3': Counter({'G': 1, 'F': 1})}

df = pd.DataFrame.from_dict(d, orient='index').fillna(0).astype(int).sort_index(axis=1)
print (df)
    A  B  C  D  E  F  G
L1  3  1  1  1  0  0  0
L2  0  2  0  1  2  0  0
L3  0  0  0  0  0  1  1

Upvotes: 2

Related Questions