Sattyaki
Sattyaki

Reputation: 406

How to identify DataFrame columns containing specific pattern

I want to identify all the columns that contains only a specific pattern, which is 32 characters long hex code. So if I have the following DataFrame for example

Incident                                SLA Contributor                         Priority       Last Group
-----------------------------------------------------------------------------------------------------------------------------
0000329edb523708f8079044db9619f0        DAP/IBM                                    P2        3a03ba3a6f22f2c44484f5269f3ee425
3bb02e66db97a70cf8079044db961907        EUX/SD                                     P4        102a3748dbee2200d4c17868bf9619b1
3bb02e66db97a70cf8079044db961907        fa863d3fdb1d3740205a328c7c961982           P4        102a3748dbee2200d4c17868bf9619b1
82cf9812dbd80098b6629c4adb96198d        DAP/IBM                                    P4        3a03ba3a6f22f2c44484f5269f3ee425
9e12a6c11b228c14de583150cd4bcbbb        3a03ba3a6f22f2c44484f5269f3ee425           P4        64f4d44fdbf666004f9a7cbdae961919
b17c3f43db9bf700b6629c4adb9619ca        NET/BT                                     P3        64f4d44fdbf666004f9a7cbdae961919
fa863d3fdb1d3740205a328c7c961982        DAP/IBM                                    P2        3a03ba3a6f22f2c44484f5269f3ee425

So it should identify the Incident and Last Group column, because all the values in these two columns are 32 characters long hex code. Some values of SLA Contributor column are 32 long hex code but there are other values also which does not follow that pattern, so it should not include SLA Contributor column.

I am not sure if I have to loop through all the values of a column to check, or there is any better solution for it. I would really appreciate if you could point me in the right direction.

Thanks, Sattyaki

Upvotes: 0

Views: 84

Answers (2)

jezrael
jezrael

Reputation: 863791

For more general solution first select only strings columns by DataFrame.select_dtypes.

Use Series.str.len for test length per columns and then test if all values are Trues by DataFrame.all, last filter columns names:

df1 = df.select_dtypes(object)

mask = df1.apply(lambda x: x.str.len().eq(32)).all()

If need test hexadecimal 32 length string use Series.str.contains with na parameter with True:

mask = df1.apply(lambda x: x.str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()

cols = df1.columns[mask].tolist()
print (cols)
['Incident', 'Last Group']

List comprehension alternative is:

cols = [c for c in df.columns if df[c].str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()]
print (cols)
['Incident', 'Last Group']

Upvotes: 1

PieCot
PieCot

Reputation: 3639

You can use all and a regular expression:

hex_regex = re.compile(r"^[0-9a-fA-F]{32}$")  # regular expression of an hex code of 32 digits

hex_cols = [col for col in df.columns if df[col].map(lambda x: hex_regex.match(x) is not None).all()]  # check that all the entries in a column match the required format

If nan are allowed, you should skip them, so use the following:

hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]

In both cases, hex_cols is ['Incident', 'Last Group'].

Here a full example (with nan values):

import pandas as pd
import re

df = pd.DataFrame(
    data=[
        ['0000329edb523708f8079044db9619f0', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['3bb02e66db97a70cf8079044db961907', 'EUX/SD', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
        ['3bb02e66db97a70cf8079044db961907', 'fa863d3fdb1d3740205a328c7c961982', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
        ['82cf9812dbd80098b6629c4adb96198d', 'DAP/IBM', 'P4', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['9e12a6c11b228c14de583150cd4bcbbb', '3a03ba3a6f22f2c44484f5269f3ee425', 'P4', '64f4d44fdbf666004f9a7cbdae961919'],
        ['b17c3f43db9bf700b6629c4adb9619ca', 'NET/BT', 'P3', '64f4d44fdbf666004f9a7cbdae961919'],
        ['fa863d3fdb1d3740205a328c7c961982', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        [None, 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['b17c3f43db9bf700b6629c4adb9619ce', 'DAP/IBM', 'P2', None],
        [None, 'DAP/IBM', 'P2', None],
    ],
    columns=['Incident', 'SLA Contributor', 'Priority', 'Last Group'],
)

hex_regex = re.compile(r"^[0-9a-fA-F]{32}$")

hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]

If you like one-liner, you can compress the previous code:

[col for col in df.columns if df[df[col].notnull()][col].map(lambda x: re.compile(r"^[0-9a-fA-F]{32}$").match(x) is not None).all()]

Upvotes: 3

Related Questions