Reputation: 3423
I have been given a lot of JSON files that have the following format.
{
"y":[
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,0,0,15866,15866,15866,16869,17116,17400,17412],
[53,3253,3253,3253,3253,3253,3253,3253,3249,3249],
[0,0,0,0,0,0,0,0,0,0],
[342,16342,16342,16342,16342,16342,16342,16342,16342,16342],
[13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
[0,0,0,25,81,12,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0],
[0,2193,2175,2175,4050,4059,4059,4089,4079,3695],
[4,0,0,0,0,0,0,77,0,0],
[0,75,75,75,78,78,78,734,732,732]
],
"labels":[
"Developer 1",
"Developer 10",
"Developer 2",
"Developer 3",
"Developer 4",
"Developer 11",
"Developer 5",
"Developer 6",
"Developer 7",
"Developer 12",
"Developer 8",
"Developer 6",
"Developer 7"
]
}
The data elements in the y
have the same index as the label in labels
. The problem I have is that some times the same label appears twice. In this example, Developer 6
appears at indexes 7 and 11 and Developer 7
appears at indexes 8 and 12.
I'd like to merge the data for the duplicates. I can do this by just adding the items in the lists for the duplicate record. Example for Developer 6.
The duplicate data rows are:
[13427,14033,14606,115822,120711,121270,125757,145946,150498,150634],
[4,0,0,0,0,0,0,77,0,0],
A merged record would be:
[13431,14033,14606,115822,120711,121270,125757,146023,150498,150634],
This is where I get stuck though. I want to remove one of the old rows AND the duplicate label. Then I need to be able to repeat the process for any other duplicate labels, but at this point I've messed up indexes.
How can I merge the duplicate data rows, remove the duplicate labels and do this for any and all duplicate labels in my file?
Upvotes: 3
Views: 564
Reputation: 20669
You can try this.
import numpy as np
out=list(zip(a['y'],a['labels']))
''' out looks like this
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 1')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 10')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 2')
([0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400, 17412], 'Developer 3')
([53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249], 'Developer 4')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 11')
([342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342], 'Developer 5')
([13427, 14033, 14606, 115822, 120711, 121270, 125757, 145946, 150498, 150634], 'Developer 6')
([0, 0, 0, 25, 81, 12, 0, 0, 0, 0], 'Developer 7')
([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'Developer 12')
([0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695], 'Developer 8')
([4, 0, 0, 0, 0, 0, 0, 77, 0, 0], 'Developer 6')
([0, 75, 75, 75, 78, 78, 78, 734, 732, 732], 'Developer 7')'''
out=list(map(list,out))
for i,val in enumerate(out):
out[i][0]=np.array(val[0])
new_dict={}
for v,k in out:
if not new_dict.get(k):
new_dict[k]=[v]
else:
new_dict[k].append(v)
''' new_dict looks like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])])
('Developer 4', [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])])
('Developer 6', [array([ 13427, 14033, 14606, 115822, 120711, 121270, 125757, 145946,
150498, 150634]), array([ 4, 0, 0, 0, 0, 0, 0, 77, 0, 0])])
('Developer 7', [array([ 0, 0, 0, 25, 81, 12, 0, 0, 0, 0]), array([ 0, 75, 75, 75, 78, 78, 78, 734, 732, 732])])
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''
temp=np.zeros(10) #each array corresponding to each developer is of size 10
for idx,i in enumerate(new_dict.items()):
i[0]
if len(i[1])>1:
for l in i[1]:
temp=temp+l
new_dict.update({i[0]:temp})
#print(temp)
temp=np.zeros(10)
'''Now new_dict,items() will like this
('Developer 1', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 10', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 2', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 3', [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])])
('Developer 4', [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])])
('Developer 11', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 5', [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])])
('Developer 6', array([ 13431., 14033., 14606., 115822., 120711., 121270., 125757.,
146023., 150498., 150634.]))
('Developer 7', array([ 0., 75., 75., 100., 159., 90., 78., 734., 732., 732.]))
('Developer 12', [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('Developer 8', [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])'''
a,b=zip(*new_dict.items())
res={'y':a,'label':b}
res
is what you need.
output
import pandas as pd
print(res)
df=pd.DataFrame(res)
print(df)
{'y': ('Developer 1', 'Developer 10', 'Developer 2', 'Developer 3', 'Developer 4',
'Developer 11', 'Developer 5', 'Developer 6', 'Developer 7', 'Developer 12', 'Developer 8'),
'label': ([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 0, 0, 0, 15866, 15866, 15866, 16869, 17116, 17400,
17412])], [array([ 53, 3253, 3253, 3253, 3253, 3253, 3253, 3253, 3249, 3249])], [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 342, 16342, 16342, 16342, 16342, 16342, 16342, 16342, 16342,
16342])], array([ 13431., 14033., 14606., 115822., 120711., 121270., 125757.,
146023., 150498., 150634.]), array([ 0., 75., 75., 100., 159., 90., 78., 734., 732., 732.]), [array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])], [array([ 0, 2193, 2175, 2175, 4050, 4059, 4059, 4089, 4079, 3695])])}
y label
0 Developer 1 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
1 Developer 10 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
2 Developer 2 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
3 Developer 3 [[0, 0, 0, 15866, 15866, 15866, 16869, 17116, ...
4 Developer 4 [[53, 3253, 3253, 3253, 3253, 3253, 3253, 3253...
5 Developer 11 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
6 Developer 5 [[342, 16342, 16342, 16342, 16342, 16342, 1634...
7 Developer 6 [13431.0, 14033.0, 14606.0, 115822.0, 120711.0...
8 Developer 7 [0.0, 75.0, 75.0, 100.0, 159.0, 90.0, 78.0, 73...
9 Developer 12 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
10 Developer 8 [[0, 2193, 2175, 2175, 4050, 4059, 4059, 4089,...
Upvotes: 1