Reputation: 25

How to write most efficient way to add a value for an column in dataframe python?

I have one data frame df consisting of a 2 columns(word and meaning/definition of that word). I want to use the Collections.Counter object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible.

The traditional approach would be to iterate over the data frame using the iterrows() methods and do the computations.

Sample output

<table style="height: 59px;" border="True" width="340">
  <tbody>
    <tr>
      <td>Word</td>
      <td>Meaning</td>
      <td>Word Freq</td>
    </tr>
    <tr>
      <td>Array</td>
      <td>collection of homogeneous datatype</td>
      <td>{'collection':1,'of':1....}</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
    </tr>
  </tbody>
</table>

Upvotes: 1

Answers (3)

piRSquared

Reputation: 294248

I intend for this answer to be useful but not the chosen answer. In fact, I'm only making an argument for Counter and @TedPetrou's answer.

create large example of random words

a = np.random.choice(list(ascii_lowercase), size=(100000, 5))

definitions = pd.Series(
    pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')

definitions.head()

0    hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1    iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2    ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3    uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4    npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object

timing
Counter is an order of 1000 times faster than fastest I could think of.

Upvotes: 0

Ted Petrou

Reputation: 61947

I would take advantage of Pandas str accessor methods and do this

from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())

Some Test data

df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})

print(df)
             definition   word
0  this is a definition   some
1    another definition  words
2  one final definition    yes

And then concatenating and splitting by space and using Counter

Counter(df.definition.str.cat(sep=' ').split())

Counter({'a': 1,
         'another': 1,
         'definition': 3,
         'final': 1,
         'is': 1,
         'one': 1,
         'this': 1})

Upvotes: 2

James

Reputation: 36608

Assuming that df has two columns 'word' and 'definition', then you simply use the .map method with Counter on the definition series after splitting on space. Then sum the result.

from collections import Counter

def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()

Upvotes: 0

How to write most efficient way to add a value for an column in dataframe python?

Answers (3)

Related Questions