Christopher
Christopher

Reputation: 627

How to do a word count in MongoDB

Can anyone recommend a way that I can do this Python code as a MongoDB query?

import pandas as pd 
data = pd.read_csv("elonmusk.csv") 
from collections import Counter
Counter(" ".join(data["tweet"]).split()).most_common(100)

I am looking for help to write a MongoDB query that can create a similar output as the Python code shown here.

Analysing all the text of one field and returning the most common words.

enter image description here

I believe MongoDB word cloud link here has a similar solution https://docs.mongodb.com/charts/saas/chart-type-reference/word-cloud/ However I have to write the code in the MongoDB shell.

I was not sure how to apply the following Stackoverflow solution in this link Most frequent word in MongoDB collection

Thanks in advance for any advice.

Upvotes: 1

Views: 635

Answers (1)

Takis
Takis

Reputation: 8705

Query

  • query assumes that the text is on tweet field
  • match to find the words with regex (word contains only alphanumeric characters here)
  • unwind
  • group and count
  • $set to fix structure (you can use project also its the same)
  • sort by word count
  • limit to keep only the 100 most popular words

Playmongo

aggregate(
[{"$match": {"$expr": {"$eq": [{"$type": "$tweet"}, "string"]}}},
 {"$set": 
    {"matches": {"$regexFindAll": {"input": "$tweet", "regex": "\\w+"}}}},
 {"$project": {"tokens": "$matches.match"}},
 {"$unwind": {"path": "$tokens"}},
 {"$group": {"_id": "$tokens", "count": {"$sum": 1}}},
 {"$set": {"word": "$_id", "_id": "$$REMOVE"}},
 {"$sort": {"count": -1}},
 {"$limit": 100}])

Upvotes: 1

Related Questions