OllyAginous
OllyAginous

Reputation: 51

Frequency Distribution Comparison Python

I'm using python and nltk to study some texts and I want to compare the frequency distributions of parts of speech across the different texts.

I can do it for one text:

from nltk import *

X_tagged = pos_tag(word_tokenize(open('/Users/X.txt').read()))

X_fd = FreqDist([tag for word, tag in X_tagged])
X_fd.plot(cumulative=True, title='Part of Speech Distribution in Corpus X')

I've tried to add another but without much luck. I've the conditional frequency distribution example for comparing the count of three words across several texts, but instead I'd like the lines to represent four different texts, the y-axis to represent the counts and the x-axis to represent the different parts of speech. How do I compare texts Y and Z in the same graph?

Upvotes: 2

Views: 3521

Answers (3)

bmaz
bmaz

Reputation: 145

Here is an example using matplotlib:

from matplotlib import pylab as plt
from nltk import *
import numpy as np

# you may use a tokenizer like nltk.tokenize.word_tokenize()
dist = {}
dist["win"] = FreqDist(tokenizer("first text"))
dist["draw"] =  FreqDist(tokenizer("second text"))
dist["lose"] =  FreqDist(tokenizer("third text"))
dist["mixed"] = FreqDist(tokenizer("fourth text"))

# sorted list of 50 most common terms in one of the texts
# (too many terms would be illegible in the graph)
most_common = [item for item, _ in dist["mixed"].most_common(50)] 

colors = ["green", "blue", "red", "turquoise"]

# loop over the dictionary keys to plot each distribution
for i, label in enumerate(dist):
    frequency = [dist[label][term] for term in most_common]
    color = colors[i]
    plt.plot(frequency, color=color, label=label)
plt.gca().grid(True)
plt.xticks(np.arange(0, len(most_common), 1), most_common, rotation=90)
plt.xlabel("Most common terms")
plt.ylabel("Frequency")
plt.legend(loc="upper right")
plt.show()

Upvotes: 0

OllyAginous
OllyAginous

Reputation: 51

I figured this out, if anyone's interested; you need to get your separate frequency distributions and enter them into a dictionary with keys common to all of the FreqDists and a tuple of values representing the result for each of the FreqDists, then you need to plot the values for each FreqDist and set the keys as the xvalues, in the same order you pull them out.

win = FreqDist([tag for word, tag in win]) # 'win', 'draw', 'lose' and 'mixed' are already POS tagged (lists of tuples ('the', 'DT'))

draw = FreqDist([tag for word, tag in draw])

lose = FreqDist([tag for word, tag in lose])

mixed = FreqDist([tag for word, tag in mixed])

POS = [item for item in win] # list of common keys

results = {}
for key in POS:
    results[key] = tuple([win[key], draw[key], lose[key], mixed[key]]) # one key, tuple of values for each FreqDist (in order)

win_counts = [results[item][0] for item in results]

draw_counts = [results[item][1] for item in results]

lose_counts = [results[item][2] for item in results]

mixed_counts = [results[item][3] for item in results]

display = [item for item in results] # over-cautious, same as POS above

plt.plot(win_counts, color='green', label="win") # need to 'import pyplot as plt'
plt.plot(draw_counts, color='blue', label="draw")
plt.plot(lose_counts, color='red', label="lose")
plt.plot(mixed_counts, color='turquoise', label="mixed")
plt.gca().grid(True)
plt.xticks(np.arange(0, len(display), 1), display, rotation=45) # will put keys as x values
plt.xlabel("Parts of Speech")
plt.ylabel("Counts per 10,000 tweets")
plt.suptitle("Part of Speech Distribution across Pre-Win, Pre-Loss and Pre-Draw Corpora")
plt.legend(loc="upper right")
plt.show()

Upvotes: 3

b3000
b3000

Reputation: 1677

The FreqDist.plot() method is only a convenience method.

You would need to write the plotting logic yourself (using matplotlib) to include multiple frequency distributions in one plot.

The source code of the plotting function of FreqDist might be a god point to get you started. Also matplotlib has a good tutorial and beginners guide.

Upvotes: 3

Related Questions