user3440320
user3440320

Reputation: 43

Python group by 2 columns, output multiple columns

I have a tab-delimited file with movie genre and year in 2 columns:

Comedy  2013
Comedy  2014
Drama   2012
Mystery 2011
Comedy  2013
Comedy  2013
Comedy  2014
Comedy  2013
News    2012
Sport   2012
Sci-Fi  2013
Comedy  2014
Family  2013
Comedy  2013
Drama   2013
Biography   2013

I want to group the genres together by year and print out in the following format (does not have to be in alphabetical order):

Year    2011    2012    2013    2014
Biography   0   0   1   0
Comedy  0   0   5   3
Drama   0   1   1   0   
Family      0   0   1   0
Mystery 1   0   0   0
News    0   1   0   0
Sci-Fi  0   0   1   0
Sport   0   1   0   0

How should I approach it? At the moment I'm creating my output through MS Excel, but I would like to do it through Python.

Upvotes: 0

Views: 184

Answers (2)

Falko
Falko

Reputation: 17907

If you don't like to use pandas, you can do it as follows:

from collections import Counter

# load file
with open('tab.txt') as f:
    lines = f.read().split('\n')

# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]

# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))

# count genre-year combinations
C = Counter(lines)

# print table
print "Year".ljust(10),
for y in years:
    print y.rjust(6),
print
for g in genres:
    print g.ljust(10),
    for y in years:
        print `C[g + ' ' + y]`.rjust(6),
    print

The most interesting function is probably Counter, which counts the number of occurrences of each element. To make sure that the length of the separating whitespace does not influence the counting, I replace it with a single space beforehand.

Upvotes: 1

Marius
Marius

Reputation: 60180

The easiest way do to this is using the pandas library, which provides lots of way of interacting with tables of data:

df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)

Output:

year       2011  2012  2013  2014
genre                            
Biography     0     0     1     0
Comedy        0     0     5     3
Drama         0     1     1     0
Family        0     0     1     0
Mystery       1     0     0     0
News          0     1     0     0
Sci-Fi        0     0     1     0
Sport         0     1     0     0

If you're only just starting with Python, you might find trying to learn pandas is a bit too much on top of learning the language, but once you have some Python knowledge, pandas provides very intuitive ways to interact with data.

Upvotes: 1

Related Questions