merit_2
merit_2

Reputation: 471

detect pandas column names if similiar match and process

I have data that gets imported into a Pandas dataframe where that elements that are lists are being separated into new columns automatically. My data was originally .root files and I am importing them in to Pandas with Uproot

Below is example data where the column physics[0] and physics2 were originally elements of a list

data = {'physics[0]': [1,2,3], 'physics[1]': [4,5,6], 'yes': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)



   physics[0]  physics[1]  yes  no 
0           1           4    7  10  
1           2           5    8  11  
2           3           6    9  12  

I am trying to come up with a technique to detect similar column names and recreate the elements as list. Here is what I have so far:

lst = [col for col in df.columns if 'physics' in col]

df['physics']=df[lst].values.tolist()

    yes  no physics
0    7  10  [1, 4]
1    8  11  [2, 5]
2    9  12  [3, 6]

which works. I won't always know before hand what the column names will be when this happens. But I'd like to be able to detect if the names are similar automatically and perform the above list comprehension.

Upvotes: 1

Views: 695

Answers (3)

Stepan
Stepan

Reputation: 1054

You can generalize your approach using regex:

import re
# create dictionary d of all groups of similar columns
multi_cols = filter(lambda x: re.search(r'\[[0-9]+\]$',x),df.columns)
d = {}
for c in multi_cols:
    k = re.sub(r'\[[0-9]+\]$', '' , str(c))
    if k not in d:
        d[k] = []
    d[k].append(c)

# the dictionary will be as following:
print(d)
# {'physics': ['physics[0]', 'physics[1]']}

# use dictionary d to combine all similar columns in each group
for k in d:
    df[k] = df[d[k]].values.tolist()

Upvotes: 2

David
David

Reputation: 126

Can we assume that any duplicate columns will always contain [0] at the end ?

what about something like this -

data = {'physics[0]': [1,2,3], 'physics[1]': [4,5,6], 'yes': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

duplicates = set([])
columns = df.columns
for c in columns:
    if c.endswith('[0]') and c.replace('[0]', '') not in duplicates:
        duplicates.add(c.replace('[0]', ''))

for d in duplicates:
    lst = [col for col in df.columns if d in col]
    df[d]=df[lst].values.tolist()

Upvotes: 0

Michael B
Michael B

Reputation: 578

It may be worth checking out difflib. You can create l1 and l2 lists from your column headers, then utilize difflib's matching:

l1 = list(df.columns.values)
l2 = list(df.columns.values)

import difflib
difflib.get_close_matches(l1, l2)

Upvotes: 0

Related Questions