Reputation: 93
I have a data in following form present in a single column in csv file.
['hhcb', 'hcbc', 'cbcc', 'bccc', 'cccd', 'ccdd', 'cddh']
['fahb', 'ahba', 'hbac', 'bacc']
['hchc', 'chcb', 'hcbh']
['hhhh', 'hhhh', 'hhhc', 'hhcd', 'hcdc', 'cdcc']
['habb', 'abbb', 'bbbb', 'bbbc', 'bbcc', 'bccd', 'ccdh', 'cdhd']
I have to find the most occurring four length string in this data. Please suggest the way. (It is an example, the original data is large)
Upvotes: 1
Views: 442
Reputation: 863176
You can try apply
Series
for creating DataFrame
, then stack
and value_counts
. Last one possible filter top values is by head
or [:5]
:
print df
a
0 [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1 [fahb, ahba, hbac, bacc]
2 [hchc, chcb, hcbh]
3 [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4 [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]
print df.a.apply(pd.Series).stack().value_counts()[:1]
hhhh 2
dtype: int64
EDIT:
If you need top 5
with remove duplicates in each row, use drop_duplicates
:
print df
a
0 [hhcb, hhcb, cbcc, bccc, bbbb, hhcb, hhcb]
1 [fahb, ahba, hhcd, fahb]
2 [hcbh, hhcd, hcbh]
3 [hhhh, hhhh, hhhc, hhcd, hhcb, bbbb]
4 [habb, habb, bbbb, bbbc, cbcc, bccd, ccdh, cdhd]
df1 = df.a.apply(pd.Series)
.stack()
.groupby(level=0)
.apply(lambda x: x.drop_duplicates())
.value_counts()[:5]
print df1
bbbb 3
hhcd 3
hhcb 2
cbcc 2
habb 1
dtype: int64
Upvotes: 2
Reputation: 109626
You can use Counter
, updating it for each word that is of length four. Then use most_common()
to get the top values.
from collections import Counter
c = Counter()
for row in df.ngram.values:
for word in row:
if len(word) == 4:
c.update([word])
>>> c.most_common()[0]
('hhhh', 2)
Timings
%%timeit
for row in df.ngram.values:
for word in row:
if len(word) == 4:
c.update([word])
10000 loops, best of 3: 87.7 µs per loop
%%timeit
df.ngram.apply(pd.Series).stack().value_counts().head(1)
100 loops, best of 3: 2.4 ms per loop
%timeit pd.Series(df.ngram.sum()).value_counts().index[0]
1000 loops, best of 3: 474 µs per loop
Upvotes: 2
Reputation: 76947
Here's one way.
In [78]: ngram
Out[78]:
0 [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1 [fahb, ahba, hbac, bacc]
2 [hchc, chcb, hcbh]
3 [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4 [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]
dtype: object
In [79]: pd.Series(ngram.sum()).value_counts()[:1]
Out[79]:
hhhh 2
dtype: int64
To kind of cheat with .sum()
operation, which will join the lists.
Upvotes: 1