Reputation: 55
I'm trying to create a python script to help me normalize my database. I've written up some example code.
I'm trying to check if dataArray exists in df1, under the names of the columns columArray
if it does exist, return the ID of the row (1 would return for both cases)
if it does not exist, add a new row and return the ID of the new row (3 would return for both)
In [df1, df2] I've got a few colums I don't care about scanning ('ID1','ID2','C','F')
Here's my current code.
import pandas as pd
df1=pd.DataFrame({
'ID1': ['0', '1', '2'],
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['Data does', 'not matter', '']})
df2=pd.DataFrame({
'ID2': ['0', '1', '2'],
'D': ['D0', 'D1', 'D2'],
'E': ['E0', 'E1', 'E2'],
'F': ['Data does', 'not matter', '']})
df_import = pd.DataFrame({
'A': ['A1', 'A4'],
'B': ['B1', 'B4'],
'C': ['C1', 'C4'],
'D': ['D1', 'D4'],
'E': ['E1', 'E4']
})
def n1(dataFrame,pkName,columArray,dataArray):
#if dataArray in df.columArray:
if 1!=1:
return('Win')
#return index/pkName (ID# '1')
else:
dataFrame.loc[len(dataFrame.index)] = [len(dataFrame.index), dataArray[0], dataArray[1], ""]
return(len(dataFrame.index)-1)
i=0
for index, row in df_import.iterrows():
pk_test1=n1(df1,'ID1',['A','B'],[row['A'], row['B']])
pk_test2=n1(df2,'ID2',['D','E'],[row['D'], row['E']])
print(pk_test1)
print(pk_test2)
i+=1
print(df1)
print()
print(df2)
print()
print(df_import)
and the output
3
3
4
4
ID1 A B C
0 0 A0 B0 Data does
1 1 A1 B1 not matter
2 2 A2 B2
3 3 A1 B1
4 4 A4 B4
ID2 D E F
0 0 D0 E0 Data does
1 1 D1 E1 not matter
2 2 D2 E2
3 3 D1 E1
4 4 D4 E4
A B C D E
0 A1 B1 C1 D1 E1
1 A4 B4 C4 D4 E4
I'd like the output to be
1
3
1
3
ID1 A B C
0 0 A0 B0 Data does
1 1 A1 B1 not matter
2 2 A2 B2
3 3 A4 B4
ID2 D E F
0 0 D0 E0 Data does
1 1 D1 E1 not matter
2 2 D2 E2
3 3 D4 E4
A B C D E
0 A1 B1 C1 D1 E1
1 A4 B4 C4 D4 E4
The part of this I'm particularly asking about is the commended out if
line, and the return on true.
Upvotes: 0
Views: 72
Reputation: 55
This function will add the dataArray to the dataFrame if not found in the subcolumns columArray, and return the value of the pkName of the dataFrame at the matching index.
def n1(dataFrame,pkName,columArray,dataArray):
match = (dataArray == dataFrame[columArray]).all(1)
if not match.any():
df_length = len(dataFrame.index)
dataFrame.loc[df_length] = [df_length] + dataArray + ['']
return df_length
return dataFrame[pkName][match.argmax()]
Edit: Removed duplicate cast of (dataArray == dataFrame[columArray]).all(1)
and len(dataFrame.index)
by adding variables to hold those, nearly doubling speed
Upvotes: 0
Reputation: 23773
I'm particularly asking about is the commended out if line
>>> df1
ID1 A B C
0 0 A0 B0 Data does
1 1 A1 B1 not matter
2 2 A2 B2
>>> columArray
['A', 'B']
>>> find = ['A1','B1']
Use .all(1).any()
for the conditional.
if (find == df1[columArray]).all(1).any():
Use .all(0)
to find rows where all columns equal the criteria; use .argmax()
to return the index of the first row equal to the criteria.
>>> (find == df1[columArray]).all(1).argmax()
1
>>>
Upvotes: 2