Reputation: 9
Is there an easy way to load json file with the following structure:
('ID_1', 'col1_1', 'col2_1' col3_1', 'key1', 'value1', 'col6_1')
('ID_1', 'col1_1', 'col2_1' col3_1', 'key2', 'value2', 'col6_1')
('ID_1', 'col1_1', 'col2_1' col3_1', 'key3', 'value3', 'col6_1')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key1', 'value1', 'col6_2')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key2', 'value2', 'col6_2')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key3', 'value3', 'col6_2')
to achieve:
('ID_1', 'col1_1', 'col2_1' col3_1', 'key1', 'key2', 'key3', col6_1')
('ID_2', 'col1_2', 'col2_2' col3_2', 'key1', 'key2', 'key3', col6_2')
and value1, value2, value3 assigned to key1, key2, key3 accordingly?
I would like to use pandas or pyspark functions.
Upvotes: 0
Views: 48
Reputation: 31
This file structure is an invalid JSON file but you can use DataFrame.drop_duplicates()
to drop duplicates:
import pandas as pd
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df.drop_duplicates(subset=['brand'], keep='first', inplace=True, ignore_index=True)
print(df)
Upvotes: 0