Reputation: 406
I want to identify all the columns that contains only a specific pattern, which is 32 characters long hex code. So if I have the following DataFrame for example
Incident SLA Contributor Priority Last Group
-----------------------------------------------------------------------------------------------------------------------------
0000329edb523708f8079044db9619f0 DAP/IBM P2 3a03ba3a6f22f2c44484f5269f3ee425
3bb02e66db97a70cf8079044db961907 EUX/SD P4 102a3748dbee2200d4c17868bf9619b1
3bb02e66db97a70cf8079044db961907 fa863d3fdb1d3740205a328c7c961982 P4 102a3748dbee2200d4c17868bf9619b1
82cf9812dbd80098b6629c4adb96198d DAP/IBM P4 3a03ba3a6f22f2c44484f5269f3ee425
9e12a6c11b228c14de583150cd4bcbbb 3a03ba3a6f22f2c44484f5269f3ee425 P4 64f4d44fdbf666004f9a7cbdae961919
b17c3f43db9bf700b6629c4adb9619ca NET/BT P3 64f4d44fdbf666004f9a7cbdae961919
fa863d3fdb1d3740205a328c7c961982 DAP/IBM P2 3a03ba3a6f22f2c44484f5269f3ee425
So it should identify the Incident
and Last Group
column, because all the values in these two columns are 32 characters long hex code. Some values of SLA Contributor
column are 32 long hex code but there are other values also which does not follow that pattern, so it should not include SLA Contributor
column.
I am not sure if I have to loop through all the values of a column to check, or there is any better solution for it. I would really appreciate if you could point me in the right direction.
Thanks, Sattyaki
Upvotes: 0
Views: 84
Reputation: 863791
For more general solution first select only strings columns by DataFrame.select_dtypes
.
Use Series.str.len
for test length per columns and then test if all values are True
s by DataFrame.all
, last filter columns names:
df1 = df.select_dtypes(object)
mask = df1.apply(lambda x: x.str.len().eq(32)).all()
If need test hexadecimal 32 length string use Series.str.contains
with na
parameter with True
:
mask = df1.apply(lambda x: x.str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()
cols = df1.columns[mask].tolist()
print (cols)
['Incident', 'Last Group']
List comprehension alternative is:
cols = [c for c in df.columns if df[c].str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()]
print (cols)
['Incident', 'Last Group']
Upvotes: 1
Reputation: 3639
You can use all
and a regular expression:
hex_regex = re.compile(r"^[0-9a-fA-F]{32}$") # regular expression of an hex code of 32 digits
hex_cols = [col for col in df.columns if df[col].map(lambda x: hex_regex.match(x) is not None).all()] # check that all the entries in a column match the required format
If nan are allowed, you should skip them, so use the following:
hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]
In both cases, hex_cols
is ['Incident', 'Last Group']
.
Here a full example (with nan values):
import pandas as pd
import re
df = pd.DataFrame(
data=[
['0000329edb523708f8079044db9619f0', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
['3bb02e66db97a70cf8079044db961907', 'EUX/SD', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
['3bb02e66db97a70cf8079044db961907', 'fa863d3fdb1d3740205a328c7c961982', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
['82cf9812dbd80098b6629c4adb96198d', 'DAP/IBM', 'P4', '3a03ba3a6f22f2c44484f5269f3ee425'],
['9e12a6c11b228c14de583150cd4bcbbb', '3a03ba3a6f22f2c44484f5269f3ee425', 'P4', '64f4d44fdbf666004f9a7cbdae961919'],
['b17c3f43db9bf700b6629c4adb9619ca', 'NET/BT', 'P3', '64f4d44fdbf666004f9a7cbdae961919'],
['fa863d3fdb1d3740205a328c7c961982', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
[None, 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
['b17c3f43db9bf700b6629c4adb9619ce', 'DAP/IBM', 'P2', None],
[None, 'DAP/IBM', 'P2', None],
],
columns=['Incident', 'SLA Contributor', 'Priority', 'Last Group'],
)
hex_regex = re.compile(r"^[0-9a-fA-F]{32}$")
hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]
If you like one-liner, you can compress the previous code:
[col for col in df.columns if df[df[col].notnull()][col].map(lambda x: re.compile(r"^[0-9a-fA-F]{32}$").match(x) is not None).all()]
Upvotes: 3