Reputation: 71
I have uploaded some tweets in a Mongo DB collection and I would like to extract the following information with PyMongo:
i.e. I would like to know how many time an hashtag has been used. However, since hashtags are case sensitive, I would like to consider them as lowercase (so that myTag and MyTag are considered and counted together).
I used the following pipeline to get the most used hashtags but I'm not able to apply the lowercase function:
tweets.aggregate([
{'$project': {'tags': '$entities.hashtags.text', '_id': 0}},
{'$unwind': '$tags'},
{'$group': {'_id': '$tags', 'count': {'$sum': 1}}}
])
Here an example of document (tweet), where I removed some of the fields I'm not interested in:
{'_id': ObjectId('604c805b289d1ef5947e1845'),
'created_at': 'Fri Mar 12 04:36:10 +0000 2021',
'display_text_range': [0, 140],
'entities': {'hashtags': [{'indices': [124, 136], 'text': 'MyTag'}],
'symbols': [],
'urls': [],
'user_mentions': [{'id': 123,
'id_str': '123',
'indices': [3, 14],
'name': 'user_name',
'screen_name': 'user_screen_name'}]},
'user': {'id': 456,
'id_str': '456',
'name': 'Author Name',
'screen_name': 'Author Screen Name'}},
{'_id': ObjectId('604c805b289d1ef5947e1845'),
'created_at': 'Fri Mar 12 04:36:10 +0000 2021',
'display_text_range': [0, 140],
'entities': {'hashtags': [{'indices': [124, 136], 'text': 'MyTAG'}],
'symbols': [],
'urls': [],
'user_mentions': [{'id': 123,
'id_str': '123',
'indices': [3, 14],
'name': 'user_name',
'screen_name': 'user_screen_name'}]},
'user': {'id': 456,
'id_str': '456',
'name': 'Author Name',
'screen_name': 'Author Screen Name'}}
In this example I would expect something like:
{'_id': 'mytag',
'count': '2'}
Can someone help me?
Thank you in advance for your help!
Francesca
Upvotes: 1
Views: 75
Reputation: 8894
You can use $toLower
db.collection.aggregate([
{
"$project": {
"tags": "$entities.hashtags.text",
"_id": 0
}
},
{
"$unwind": "$tags"
},
{
"$group": {
"_id": {
$toLower: "$tags"
},
"count": {
"$sum": 1
}
}
}
])
Working Mongo playground
Upvotes: 1