Cyber_Tron
Cyber_Tron

Reputation: 299

Generate hash table comprising of 4 string keys/numeric values in Pandas

I have a CSV file as, which I am reading into a dataframe,

ZoneMaterialName1,ZoneThickness1,ZoneMaterialName2,ZoneThickness2,ZoneMaterialName3,ZoneThickness3,ZoneMaterialName4,ZoneThickness4
Copper,2.5,Silver,5,Gold,12,Selenium,6
Copper,2.5,Silver,5,Gold,12,Selenium,6
Copper,2,Silver,8,Gold,2,Selenium,3
Aluminium,3,Sodium,14,,,Titanium,5
Aluminium,13,Sodium,5,,,Titanium,15

I want to generate a hash table with the key comprising of unique 4 ZoneMaterialName Fields and the values being their respective ZoneThickness numeric fields.

Example of desired output:

Copper,Silver,Gold,Selenium:[[2.5,5,12,6],[2,8,2,3]]
Aluminium,Sodium,,Titanium:[[3,14,,5],[13,5,,15]]

If the four values for a unique key appear again, they are ignored and only the unique ones are taken.

Occasionally, there may be some empty fields in each row but that should also be considered in the key and value of the hash table as shown above.

I am unable to do this efficiently,

import pandas as pd
import numpy as np
df = pd.read_csv('/mnt/c/python_test/Materials.csv')
myfilter = ~df.ZoneMaterialName1.duplicated(keep='first') & \
           ~df.ZoneMaterialName2.duplicated(keep='first') & \
           ~df.ZoneMaterialName3.duplicated(keep='first') & \
           ~df.ZoneMaterialName34.duplicated(keep='first') & \
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
print df

I'm new to pandas so any help/guidance is appreciated!

Upvotes: 0

Views: 69

Answers (1)

Chris
Chris

Reputation: 16147

import pandas as pd
import numpy as np
df = pd.read_csv('/mnt/c/python_test/Materials.csv')

# replace nan with 'NA' for material names so they are not excluded from groupby
df[['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4']] =df[['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4']].fillna('NA')

# Get List of thickness all values for each row
df['combined'] = df.apply(lambda row: [row['ZoneThickness1'],row['ZoneThickness2'],row['ZoneThickness3'],row['ZoneThickness4']], axis=1)

# Groupby target columns, making a list of lists of thicknesses
df.groupby(['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4'])['combined'].apply(list)

# Get rid of duplicates
df['combined'] = df['combined'].apply(lambda x: set(tuple(i) for i in x))

Output

 ZoneMaterialName1 ZoneMaterialName2 ZoneMaterialName3 ZoneMaterialName4                                 combined
0         Aluminium            Sodium                NA          Titanium  {(3.0, 14, nan, 5), (13.0, 5, nan, 15)}
1            Copper            Silver              Gold          Selenium    {(2.0, 8, 2.0, 3), (2.5, 5, 12.0, 6)}

Upvotes: 1

Related Questions