Nishant ranjan
Nishant ranjan

Reputation: 93

Most occuring string in data using pandas - n gram data mining

I have a data in following form present in a single column in csv file.

['hhcb', 'hcbc', 'cbcc', 'bccc', 'cccd', 'ccdd', 'cddh']
['fahb', 'ahba', 'hbac', 'bacc']
['hchc', 'chcb', 'hcbh']
['hhhh', 'hhhh', 'hhhc', 'hhcd', 'hcdc', 'cdcc']
['habb', 'abbb', 'bbbb', 'bbbc', 'bbcc', 'bccd', 'ccdh', 'cdhd']

I have to find the most occurring four length string in this data. Please suggest the way. (It is an example, the original data is large)

Upvotes: 1

Views: 442

Answers (3)

jezrael
jezrael

Reputation: 863176

You can try apply Series for creating DataFrame, then stack and value_counts. Last one possible filter top values is by head or [:5]:

print df
                                                  a
0        [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1                          [fahb, ahba, hbac, bacc]
2                                [hchc, chcb, hcbh]
3              [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4  [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]

print df.a.apply(pd.Series).stack().value_counts()[:1]
hhhh    2
dtype: int64

EDIT:

If you need top 5 with remove duplicates in each row, use drop_duplicates:

print df
                                                  a
0        [hhcb, hhcb, cbcc, bccc, bbbb, hhcb, hhcb]
1                          [fahb, ahba, hhcd, fahb]
2                                [hcbh, hhcd, hcbh]
3              [hhhh, hhhh, hhhc, hhcd, hhcb, bbbb]
4  [habb, habb, bbbb, bbbc, cbcc, bccd, ccdh, cdhd]

df1 = df.a.apply(pd.Series)
          .stack()
          .groupby(level=0)
          .apply(lambda x: x.drop_duplicates())
          .value_counts()[:5]

print df1
bbbb    3
hhcd    3
hhcb    2
cbcc    2
habb    1
dtype: int64

Upvotes: 2

Alexander
Alexander

Reputation: 109626

You can use Counter, updating it for each word that is of length four. Then use most_common() to get the top values.

from collections import Counter

c = Counter()
for row in df.ngram.values:
    for word in row:
        if len(word) == 4:
            c.update([word])

>>> c.most_common()[0]
('hhhh', 2)

Timings

%%timeit
for row in df.ngram.values:
    for word in row:
        if len(word) == 4:
            c.update([word])
10000 loops, best of 3: 87.7 µs per loop

%%timeit
df.ngram.apply(pd.Series).stack().value_counts().head(1)
100 loops, best of 3: 2.4 ms per loop

%timeit pd.Series(df.ngram.sum()).value_counts().index[0]
1000 loops, best of 3: 474 µs per loop

Upvotes: 2

Zero
Zero

Reputation: 76947

Here's one way.

In [78]: ngram
Out[78]:
0          [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1                            [fahb, ahba, hbac, bacc]
2                                  [hchc, chcb, hcbh]
3                [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4    [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]
dtype: object

In [79]: pd.Series(ngram.sum()).value_counts()[:1]
Out[79]:
hhhh    2
dtype: int64

To kind of cheat with .sum() operation, which will join the lists.

Upvotes: 1

Related Questions