Reputation: 669
I want to create a count table from row data with entries separated by tab. For example,
Input text file:
L1 A A B A C D
L2 B B D E E
L3 F G
The output I expect is the following table:
Output table file:
A B C D E F G
L1 3 1 1 1 0 0 0
L2 0 2 0 1 2 0 0
L3 0 0 0 0 0 1 1
This is the algorithm I thought about:
Please let me know, if there is a better idea.
Thanks,
AP
Upvotes: 1
Views: 334
Reputation: 30605
A bit of hacky way is to use get_dummies
and groupby
i.e
For reading the text file we can do
df = pd.read_csv('untitled.txt',header=None)
df = df[0].str.split(expand=True).set_index(0)
#This will lead to
# 1 2 3 4 5 6
# 0
# L1 A A B A C D
# L2 B B D E E None
# L3 F G None None None None
Then we can make use of pd.get_dummies()
Option 1:
dummies= pd.get_dummies(df)
dummies.groupby(dummies.columns.str[-1],axis=1).sum()
Option 2: (for quickness use stack and value_counts i.e )
df.stack().groupby(level=0).value_counts().unstack().fillna(0)
A B C D E F G 0 L1 3 1 1 1 0 0 0 L2 0 2 0 1 2 0 0 L3 0 0 0 0 0 1 1
Upvotes: 2
Reputation: 19760
Assuming you have a text file like the one you pasted, you can read it with
df = pandas.read_table('input.txt', delimiter=' *',
header=None, engine='python', index_col=0)
Once you have the DataFrame, you can calculate the counts like this:
result = df.apply(pandas.Series.value_counts, axis=1).fillna(0)
Which results in this:
A B C D E F G
0
L1 3.0 1.0 1.0 1.0 0.0 0.0 0.0
L2 0.0 2.0 0.0 1.0 2.0 0.0 0.0
L3 0.0 0.0 0.0 0.0 0.0 1.0 1.0
Upvotes: 1
Reputation: 862611
I think you can loop by each row and use Counter
for dictionaries and last DataFrame
constructor.
Here read_csv
should failed if not same lengths of values in each row.
import csv
from collections import Counter
with open('a.txt', 'r') as f:
d = {row[0]: Counter([x for x in row[1:]]) for row in csv.reader(f, delimiter='\t')}
print (d)
{'L2': Counter({'B': 2, 'E': 2, 'D': 1}),
'L1': Counter({'A': 3, 'B': 1, 'C': 1, 'D': 1}),
'L3': Counter({'G': 1, 'F': 1})}
df = pd.DataFrame.from_dict(d, orient='index').fillna(0).astype(int).sort_index(axis=1)
print (df)
A B C D E F G
L1 3 1 1 1 0 0 0
L2 0 2 0 1 2 0 0
L3 0 0 0 0 0 1 1
Upvotes: 2