Reputation: 175
I have a dataframe with ~7.000.000 rows and a lot of columns.
Each row is a Tweet, and i have a column text with tweet's content.
I created a new column just for hashtags inside text:
df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
So i have a column called hashtags with each row containing a list structure: ['#b747', '#test']
.
I would like to count the number of each hashtag but i have a heavy number of rows. What is the most performatic way to do it?
Upvotes: 1
Views: 317
Reputation: 615
Tried solutions 1, 2 and 3 from Pierre's answer as they are most performant and I wanted one or two lines of code. Also I wanted to have the count of hashtags per tweet, i.e. per row, and not a dictionary of all the different hashtags used in the dataset including their frequency count. An up-vote from me for all the effort and different methods Pierre suggests.
Unfortunately, solutions 1, 2 and 3 all returned NaNs for the Twitter data I work with. Also tried different variants of tweets_df.Tweets.str.extractall(r'(\#\w+)')[0].value_counts()
suggested here but none of them worked.
What worked in the end are these two lines
%timeit df['hashtags'] = df['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
15.2 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['hashtagCount'] = df.hashtags.agg([len])
1.11 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
On the note of performance, this code to extract hashtags uses less time:
#alternative solution to extract hashtags and put in new column with apply
%timeit df['hashtags'] = df['text'].apply(lambda twt : re.findall(r"(#(?:[^\x00-\x7F]|\w)+)", twt))
3.68 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 0
Reputation: 26211
Here are some different approaches, along with timing, ordered by speed (fastest first):
# setup
n = 10_000
df = pd.DataFrame({
'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})
# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2. as per @Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4. iterating explicitly and using Counter().update()
def count5(s):
c = Counter()
for hl in s:
c.update(hl)
return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6. as per @EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion: the fastest is #1 (using Counter(chain(*df.hashtags))
), but the more intuitive and natural #2 (from @Psidom comment) is almost as fast. I would probably go with that. #6 (@EzerK approach) is very slow for large df
slow because we are building a new (long) list before passing it as argument to Counter()
.
Upvotes: 3
Reputation: 3739
you can all the lists to one big list and then use collections.Counter:
import pandas as pd
from collections import Counter
df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]
Counter(sum(df['hashtags'].values, []))
Upvotes: 0