user11646543
user11646543

Reputation: 49

Itertools Groupby gives an Unexpected Result

I have two Lists

say

finalblobfpost1=['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']

With Same Dates for "K1_SS_ALM"

finalblobfpost2=['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt']

With different Dates from "K1_SS_ALM"

i need to groupby with K1_SS_ALM and K1_AB_KIL (re.findall("\w+/\w+/\d+/(.*?)_\d+_\d+.txt", text))

Mycode so far:

finalblobfpost1=['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']
keyf = lambda text: (re.findall("\w+\/\w+\/\d+\/(.*?)\_\d+_\d+.txt", text)+ [text])[0].strip()
h=[list(items) for gr, items in groupby(sorted(finalblobfpost1), key=keyf)]
print(h)

Result is -Good-Enough-Expected

[['ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt', 'ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt'], ['ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt',
'ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt']]

Code:2

finalblobfpost2=['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt','ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt']
keyf1 = lambda text: (re.findall("\w+\/\w+\/\d+\/(.*?)\_\d+_\d+.txt", text)+ [text])[0].strip()
h1=[list(items) for gr, items in groupby(sorted(finalblobfpost2), key=keyf1)]
print(h1)

the result is: Not Expected

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt'], ['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt'], ['ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'], ['ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt']]

Expected is:

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt','ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'],['ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt','ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt']]

It didn't group the keywords. Is anything wrong with regex or i am doing anything wrong?

Kindly Advise please.

Upvotes: 1

Views: 55

Answers (2)

Adam.Er8
Adam.Er8

Reputation: 13393

Your list needs to be sorted by the same key function as used in groupby!

try this:

h1=[list(items) for gr, items in groupby(sorted(finalblobfpost2, key=keyf1), key=keyf1)]

Only difference is the key=keyf1 in the call to sorted

Output (same as expected):

[['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt', 'ABC/XYZ/16082020/K1_AB_KIL_333333_16082020.txt'], ['ABC/XYZ/15082020/K1_SS_ALM_222222_15082020.txt', 'ABC/XYZ/16082020/K1_SS_ALM_111111_16082020.txt']]

This is explicitly written in the docs for groupby:

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).

Upvotes: 1

sushanth
sushanth

Reputation: 8302

try this,

Regex Demo

import re
from itertools import groupby

print(
    [list(v) for _, v in groupby(finalblobfpost1,
                                 key=lambda x: re.search("\w\d+_\w{2}_\w{3}", x).group())]
)

[['ABC/XYZ/16082020/K1_SS_ALM_222222_14082020.txt', 'ABC/XYZ/16082020/K1_SS_ALM_111111_14082020.txt'], ['ABC/XYZ/15082020/K1_AB_KIL_444444_15082020.txt', 'ABC/XYZ/15082020/K1_AB_KIL_333333_15082020.txt']]

Upvotes: 0

Related Questions