botti23
botti23

Reputation: 71

Tweets Analysis with PyMongo - Lower case before counting hashtags

I have uploaded some tweets in a Mongo DB collection and I would like to extract the following information with PyMongo:

i.e. I would like to know how many time an hashtag has been used. However, since hashtags are case sensitive, I would like to consider them as lowercase (so that myTag and MyTag are considered and counted together).

I used the following pipeline to get the most used hashtags but I'm not able to apply the lowercase function:

tweets.aggregate([
    {'$project': {'tags': '$entities.hashtags.text', '_id': 0}},
    {'$unwind': '$tags'},
    {'$group': {'_id': '$tags', 'count': {'$sum': 1}}}
])

Here an example of document (tweet), where I removed some of the fields I'm not interested in:

{'_id': ObjectId('604c805b289d1ef5947e1845'),
 'created_at': 'Fri Mar 12 04:36:10 +0000 2021',
 'display_text_range': [0, 140],
 'entities': {'hashtags': [{'indices': [124, 136], 'text': 'MyTag'}],
              'symbols': [],
              'urls': [],
              'user_mentions': [{'id': 123,
                                 'id_str': '123',
                                 'indices': [3, 14],
                                 'name': 'user_name',
                                 'screen_name': 'user_screen_name'}]},
 'user': {'id': 456,
          'id_str': '456',
          'name': 'Author Name',
          'screen_name': 'Author Screen Name'}},
{'_id': ObjectId('604c805b289d1ef5947e1845'),
 'created_at': 'Fri Mar 12 04:36:10 +0000 2021',
 'display_text_range': [0, 140],
 'entities': {'hashtags': [{'indices': [124, 136], 'text': 'MyTAG'}],
              'symbols': [],
              'urls': [],
              'user_mentions': [{'id': 123,
                                 'id_str': '123',
                                 'indices': [3, 14],
                                 'name': 'user_name',
                                 'screen_name': 'user_screen_name'}]},
 'user': {'id': 456,
          'id_str': '456',
          'name': 'Author Name',
          'screen_name': 'Author Screen Name'}}

In this example I would expect something like:

{'_id': 'mytag',
'count': '2'}

Can someone help me?

Thank you in advance for your help!

Francesca

Upvotes: 1

Views: 75

Answers (1)

varman
varman

Reputation: 8894

You can use $toLower

db.collection.aggregate([
  {
    "$project": {
      "tags": "$entities.hashtags.text",
      "_id": 0
    }
  },
  {
    "$unwind": "$tags"
  },
  {
    "$group": {
      "_id": {
        $toLower: "$tags"
      },
      "count": {
        "$sum": 1
      }
    }
  }
])

Working Mongo playground

Upvotes: 1

Related Questions