Andrew Smith
Andrew Smith

Reputation: 571

How to count the number of instances of various strings in a dictionary

I have a large dictionary (10,000+ entries) of ReviewIDs. The dictionary has two keys, the first is the ReviewID # and the second is the language of the Review.

My task is to compute the total # of reviews in each language and then display it in a bar plot.

import pandas as pd
import csv
import matplotlib.pyplot as plt
import sys
RevDict = {}
with open('ReviewID.txt','r') as f:
for line in f:
    a,b = line.split(":")
    RevDict[a] = str(b)

This results in a dictionary that looks like this:

enter image description here

My idea, was to convert the dictionary into a Dataframe with the Review ID being one column and the language being a second column. I could then iterate through the rows using a counter and end up with a final count for each language. This could easily be converted into a bar plot.

Unfortunately, I can't figure out how to do this.

I also suspect that the more pythonic approach would be to simply count the # of instances of each string within the dictionary itself rather than going through the step of making a dataframe. I tried this:

from collections import Counter
Counter(k['b'] for k in data if k.get('b'))

It is throwing the following error:

AttributeError: 'str' object has no attribute 'get'

Upvotes: 0

Views: 154

Answers (2)

7stud
7stud

Reputation: 48659

Using collections.Counter

import collections as coll

data = {
  'A': 'English',
  'B': 'German',
  'C': 'English'
} 

print(coll.Counter(data.values()))

--output:--
Counter({'English': 2, 'German': 1})

Using pandas:

import pandas as pd

data = {
    'A': 'fr\n',
    'B': 'de\n',
    'C': 'fr\n',
    'D': 'de\n',
    'E': 'fr\n',
    'F': 'en\n'
}

df = pd.DataFrame(
    {
        'id': list(data.keys()),
        'lang': [val.rstrip() for val in data.values()],
    }
)

print(df)

output:

  id lang
0  B   de
1  A   fr
2  F   en
3  D   de
4  E   fr
5  C   fr

grouped = df.groupby('lang')
print(grouped.size())

output:

lang
de    2
en    1
fr    3

Response to comment

Plotting:

import collections as coll
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter

data = {
    'A': 'fr\n',
    'B': 'de\n',
    'C': 'fr\n',
    'D': 'de\n',
    'E': 'fr\n',
    'F': 'en\n'
}

counter = coll.Counter(
    [val.rstrip() for val in data.values()]
)

langs, lang_counts = zip(
    *sorted(counter.items(), key=itemgetter(1))
)
total_langs = sum(lang_counts)

bar_heights = np.array(lang_counts, dtype=float) / total_langs
x_coord_left_side_of_bars = np.arange(len(langs))
bar_width = 0.8

plt.bar(
    x_coord_left_side_of_bars,
    bar_heights,
    bar_width,
)

plt.xticks(  
    x_coord_left_side_of_bars + (bar_width * 0.5),  #position of tick marks
    langs  #labels for tick marks
)
plt.xlabel('review language')
plt.ylabel('% of all reviews')

x = plt.plot()
#plt.show()  #Can use show() instead of savefig() until everything works correctly
plt.savefig('lang_plot.png')

plot:

enter image description here

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1125398

In your for k in data loop, each k is a string key (the review id). Strings have no .get() method, nor does the original variable b have any bearing on this loop.

If you wanted to count the values, just pass the values of the dictionary straight to the Counter:

Counter(data.values())

You probably want to remove the newline characters first:

for line in f:
    review_id, lang = line.split(":")
    RevDict[review_id] = lang.strip()

Upvotes: 1

Related Questions