Reputation: 755
I have a huge dataset with contents such as given below:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 203 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | 6f3606577eadacef1b956307558a1efd | |
| 203 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 720 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
HHID can be present in the file at a maximum of three times. If HHID is found once, then the VAL_CD64/VAL_CD32 should be moved to VAL1_CD64/VAL1_CD32 columns, if found 2nd time, second value should be moved to VAL2_CD64/VAL2_CD32 columns, and if found 3rd time, third value should be moved to VAL3_CD64/VAL3_CD32 columns. If value is not found, then these columns should be left blank.
Output should look something like this:
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| HHID | VAL1_CD64 | VAL2_CD64 | VAL3_CD64 | VAL1_CD32 | VAL2_CD32 | VAL3_CD32 | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 373aeda34c0b4ab91a02ecf55af58e15 | 6f3606577eadacef1b956307558a1efd | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | | 2c4f97a04f02db5a36a85f48dab39b5b | 2111293e946703652070968b224875c9 | | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 5c80a555fcda02d028fc60afa29c4a40 | 6c10cd11b805fa57d2ca36df91654576 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | | | 71a91c4768bd314f3c9dc74e9c7937e8 | | | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
I tried using pivot/melt in pandas but unable to get an idea to implement it. Can anyone help in giving me a lead?
Thanks
Upvotes: 0
Views: 55
Reputation: 30002
One possible way is to combine VAL_CD32
and VAL_CD64
into list then split those list into columns:
df_ = df.groupby('HHID').agg({'VAL_CD32': list, 'VAL_CD64': list})
data = []
for col in df_.columns:
d = pd.DataFrame(df_[col].values.tolist(), index=df_.index)
d.columns = [f'{col}_{i}' for i in map(str, range(1, len(d.columns)+1))]
data.append(d)
res = pd.concat(data, axis=1)
VAL_CD32_1 VAL_CD32_2 \
HHID
203 373aeda34c0b4ab91a02ecf55af58e15 6f3606577eadacef1b956307558a1efd
348 5c80a555fcda02d028fc60afa29c4a40 6c10cd11b805fa57d2ca36df91654576
403 71a91c4768bd314f3c9dc74e9c7937e8 None
720 2c4f97a04f02db5a36a85f48dab39b5b 2111293e946703652070968b224875c9
VAL_CD32_3 \
HHID
203 332321ab150879e930869c15b1d10c83
348 6040b29107adf1a41c4f5964e0ff6dcb
403 None
720 None
VAL_CD64_1 \
HHID
203 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5
348 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496
403 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636
720 f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751
VAL_CD64_2 \
HHID
203 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8
348 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9
403 None
720 abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3
VAL_CD64_3
HHID
203 a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7
348 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37
403 None
720 None
Upvotes: 3