Reputation: 299
I have a CSV file as, which I am reading into a dataframe,
ZoneMaterialName1,ZoneThickness1,ZoneMaterialName2,ZoneThickness2,ZoneMaterialName3,ZoneThickness3,ZoneMaterialName4,ZoneThickness4
Copper,2.5,Silver,5,Gold,12,Selenium,6
Copper,2.5,Silver,5,Gold,12,Selenium,6
Copper,2,Silver,8,Gold,2,Selenium,3
Aluminium,3,Sodium,14,,,Titanium,5
Aluminium,13,Sodium,5,,,Titanium,15
I want to generate a hash table with the key comprising of unique 4 ZoneMaterialName Fields and the values being their respective ZoneThickness numeric fields.
Example of desired output:
Copper,Silver,Gold,Selenium:[[2.5,5,12,6],[2,8,2,3]]
Aluminium,Sodium,,Titanium:[[3,14,,5],[13,5,,15]]
If the four values for a unique key appear again, they are ignored and only the unique ones are taken.
Occasionally, there may be some empty fields in each row but that should also be considered in the key and value of the hash table as shown above.
I am unable to do this efficiently,
import pandas as pd
import numpy as np
df = pd.read_csv('/mnt/c/python_test/Materials.csv')
myfilter = ~df.ZoneMaterialName1.duplicated(keep='first') & \
~df.ZoneMaterialName2.duplicated(keep='first') & \
~df.ZoneMaterialName3.duplicated(keep='first') & \
~df.ZoneMaterialName34.duplicated(keep='first') & \
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
print df
I'm new to pandas so any help/guidance is appreciated!
Upvotes: 0
Views: 69
Reputation: 16147
import pandas as pd
import numpy as np
df = pd.read_csv('/mnt/c/python_test/Materials.csv')
# replace nan with 'NA' for material names so they are not excluded from groupby
df[['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4']] =df[['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4']].fillna('NA')
# Get List of thickness all values for each row
df['combined'] = df.apply(lambda row: [row['ZoneThickness1'],row['ZoneThickness2'],row['ZoneThickness3'],row['ZoneThickness4']], axis=1)
# Groupby target columns, making a list of lists of thicknesses
df.groupby(['ZoneMaterialName1','ZoneMaterialName2','ZoneMaterialName3','ZoneMaterialName4'])['combined'].apply(list)
# Get rid of duplicates
df['combined'] = df['combined'].apply(lambda x: set(tuple(i) for i in x))
ZoneMaterialName1 ZoneMaterialName2 ZoneMaterialName3 ZoneMaterialName4 combined
0 Aluminium Sodium NA Titanium {(3.0, 14, nan, 5), (13.0, 5, nan, 15)}
1 Copper Silver Gold Selenium {(2.0, 8, 2.0, 3), (2.5, 5, 12.0, 6)}
Upvotes: 1