Cathi G
Cathi G

Reputation: 49

Looking up values in dataframe

I feel like a basic dataframe use is looking up the equivalent of a 'key' to return a 'value', but I've been searching and trying so many things for days with no success. So I think I'm not trying the right things and would appreciate any help.

I've tried .to_dict() and couldn't figure out how to shape the values into something I could look up. It also seemed inefficient to make a dictionary of a dataframe made from XML. So I'm back to trying .loc[].

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import pandas as pd

#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Append rows, create and index the dataframe
thing_rows.append({"Thing Name": "thing 1 name",
                   "Thing ID": "thing_1_id"})
thing_rows.append({"Thing Name": "thing 2 name",
                   "Thing ID": "thing_2_id"})
thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
thing_df = thing_df.set_index(list(thing_df.keys())[0])
        
print(thing_df.loc["thing 1 name"])

Output:

Thing ID    thing_1_id
Name: thing 1 name, dtype: object

Desired output:

thing_1_id

While the above focuses on just this one issue, below is a little bigger picture of what I'm trying to do in case you see a simpler or better way to get my related thing IDs out of the XML.

Ultimate desired output:

,Collection item,RELATED-THING-IDs
0,name of Item 1,"thing_1_id, thing_2_id"

Python:

# -*- coding: utf-8 -*-

# Importing the required libraries
import lxml.etree as Xet
import pandas as pd

#Define main collection columns and rows for dataframe
coll_cols = ["Collection item", "RELATED-THING-IDs"]
coll_rows = []
#Define thing lookup dataframe columns and rows
thing_cols = ["Thing Name", "Thing ID"]
thing_rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for row in root:
    # Create thing lookup dataframe
    if (row.findtext('type') == "THING"):
        thing_id = row.findtext("THING-ID")
        thing_name = row.findtext("name")
        thing_rows.append({"Thing Name": thing_name,
                           "Thing ID": thing_id})
        thing_df = pd.DataFrame(thing_rows, columns=thing_cols)
        thing_df = thing_df.set_index(list(thing_df.keys())[0])
    # Find only collection items
    if row.findtext('type') != "COLLECTION-ITEM":
        continue
    # Define values for collection item dataframe
    name = row.findtext("name", "Missing name")
    relat_thing_items = thing_df.loc[[row.xpath(
        "./RELATED-THING/result/row/name/text()")],["THING-ID"]]
    if len(relat_thing_items) > 0:
        relat_thing_id = ', '.join(relat_thing_items)
    else:
        relat_thing_id = ""

    coll_rows.append({"Collection item": name,
                 "RELATED-THING-IDs": relat_thing_id
})

coll_df = pd.DataFrame(coll_rows, columns=coll_cols)

# Writing dataframe to csv
coll_df.to_csv('output.csv')

XML:

<?xml version="1.0" encoding="UTF-8"?>

<result size="4321">
  <row>
    <type>CONTEXT</type>
    <name>collections</name>
  </row>
  <row>
    <type>COLLECTION-ITEM</type>
    <name>name of Item 1</name>
    <ITEM-ID>item_000001</ITEM-ID>
    <RELATED-THING>
      <result size="2">
        <row>
          <type>THING</type>
          <name>thing name 1</name>
          <no>1</no>
        </row>
        <row>
          <type>THING</type>
          <name>thing name 2</name>
          <no>1</no>
        </row>
      </result>
    </RELATED-THING>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 1</name>
    <THING-ID>thing_000783</THING-ID>
  </row>
  <row>
    <type>THING</type>
    <name>thing name 2</name>
    <THING-ID>thing_000803</THING-ID>
  </row>
</result>

Upvotes: 0

Views: 342

Answers (2)

Karina
Karina

Reputation: 1280

Create dataframe with customized index

A simplified version

# define columns
col = []
for i in range(10):
    col.append(f'col{i}')

# define rows
row = []
for i in range(20):
    row.append(f'row{i}')
    
# create dataframe
df = pd.DataFrame(index=row, columns=col)

# fill dataframe with datas
for i in range(len(col)):
    df[f'{col[i]}'] = np.random.rand(len(row))

# this is how it looks like
print(f'this is df with row as index\n{df}')

# define another column as index, [0] in the code line below means 'col0'
df = df.set_index(list(df.keys())[0])

# this is how it looks like
print(f'\nthis is df with col0 as index\n{df}')

Output:

this is df with row as index
           col0      col1      col2      col3      col4      col5      col6      col7      col8      col9
row0   0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
row1   0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
row2   0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
row3   0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
row4   0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
row5   0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
row6   0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
row7   0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
row8   0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
row9   0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
row10  0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
row11  0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
row12  0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
row13  0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
row14  0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
row15  0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
row16  0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
row17  0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
row18  0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
row19  0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

this is df with col0 as index
              col1      col2      col3      col4      col5      col6      col7      col8      col9
col0                                                                                              
0.767971  0.469264  0.755685  0.103732  0.615639  0.257989  0.573995  0.776728  0.447322  0.004053
0.261831  0.471375  0.795415  0.205076  0.803030  0.281501  0.876232  0.629721  0.713122  0.485636
0.509729  0.545062  0.834564  0.673355  0.729216  0.207496  0.564352  0.429800  0.277141  0.658879
0.179367  0.233000  0.653200  0.953253  0.895205  0.848281  0.597804  0.187801  0.052220  0.940323
0.390854  0.455636  0.587465  0.672617  0.996579  0.877511  0.536964  0.069080  0.396630  0.992556
0.521829  0.136424  0.506181  0.674257  0.011684  0.885795  0.378833  0.545459  0.101595  0.368110
0.316271  0.775069  0.999092  0.021254  0.369629  0.244191  0.496379  0.000377  0.593364  0.603435
0.884000  0.927754  0.254314  0.904324  0.638690  0.740589  0.149640  0.751099  0.831088  0.092892
0.992638  0.368222  0.341631  0.721588  0.410636  0.690506  0.155053  0.217033  0.945478  0.799279
0.964336  0.443232  0.563750  0.420321  0.088734  0.197364  0.490591  0.850209  0.960568  0.974885
0.183250  0.634418  0.150319  0.689473  0.835240  0.591784  0.783965  0.895664  0.025180  0.459332
0.904488  0.695560  0.556103  0.883556  0.115751  0.321207  0.904646  0.028804  0.814442  0.267377
0.930931  0.660349  0.821123  0.146736  0.539237  0.030024  0.252803  0.547877  0.463775  0.773859
0.709280  0.666561  0.245371  0.726635  0.764280  0.817297  0.504265  0.105712  0.344707  0.592827
0.450645  0.990164  0.811939  0.770661  0.640772  0.929582  0.541931  0.904480  0.195679  0.582443
0.319354  0.193135  0.902301  0.508479  0.332627  0.383111  0.282576  0.326669  0.632635  0.319938
0.831707  0.668559  0.734607  0.829461  0.906679  0.239202  0.648362  0.932863  0.882962  0.893804
0.320345  0.208685  0.571314  0.529310  0.256773  0.569330  0.330747  0.207749  0.968724  0.323970
0.809379  0.780139  0.527647  0.079957  0.762411  0.789256  0.158718  0.029256  0.812138  0.669460
0.299014  0.960852  0.758356  0.222588  0.061735  0.304292  0.893981  0.945643  0.477456  0.807782

Hope it helps you understanding more about pandas DataFrame

--- edit --- In df with row as index:

print(df['col1']['row0']) will give you 0.469264.

In df with 'col0' as index:

print(df['col1'][0.767971]) will give you 0.469264.

So basically it is df[column][row]. If your row name is string, put it as string, if it is float or int, put it as float or int.

--- edit ---

How to look up dataframe

I'll try to make it clearer with another example.

import pandas as pd

df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', 'e'],
                   'col2': ['f', 'g', 'h', 'i', 'j'],
                   'col3': ['k', 'l', 'm', 'n', 'o'],
                   'col4': ['p', 'q', 'r', 's', 't'],
                   'col5': ['u', 'v', 'w', 'x', 'y']})
print('1. df is like this :\n', df)
print('-----------------------------------')
print("2. df['col1'][0] =", df['col1'][0])
print('-----------------------------------')
print("3. df['col1'] :\n", df['col1'])
print('-----------------------------------')
print("4. list(df['col1']) =", list(df['col1']))
print('-----------------------------------')
print("5. df.loc[0] :\n", df.loc[0])
print('-----------------------------------')
print("6. df.loc[[0]] :\n", df.loc[[0]])
print('-----------------------------------')
print("7. list(df.loc[0]) = ", list(df.loc[0]))
print('-----------------------------------')
print("8. df.iloc[3, 3] =", df.iloc[3, 3])
print('-----------------------------------')
print("9. df.iloc[3, :] :\n", df.iloc[3, :])
print('-----------------------------------')
print("10. list(df.iloc[3, :]) =", list(df.iloc[3, :]))
print('-----------------------------------')
print("Bonus : *list(df.iloc[3, :]) =", *list(df.iloc[3, :]))

This is the output:

1. df is like this :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
1    b    g    l    q    v
2    c    h    m    r    w
3    d    i    n    s    x
4    e    j    o    t    y
-----------------------------------
2. df['col1'][0] = a
-----------------------------------
3. df['col1'] :
 0    a
1    b
2    c
3    d
4    e
Name: col1, dtype: object
-----------------------------------
4. list(df['col1']) = ['a', 'b', 'c', 'd', 'e']
-----------------------------------
5. df.loc[0] :
 col1    a
col2    f
col3    k
col4    p
col5    u
Name: 0, dtype: object
-----------------------------------
6. df.loc[[0]] :
   col1 col2 col3 col4 col5
0    a    f    k    p    u
-----------------------------------
7. list(df.loc[0]) =  ['a', 'f', 'k', 'p', 'u']
-----------------------------------
8. df.iloc[3, 3] = s
-----------------------------------
9. df.iloc[3, :] :
 col1    d
col2    i
col3    n
col4    s
col5    x
Name: 3, dtype: object
-----------------------------------
10. list(df.iloc[3, :]) = ['d', 'i', 'n', 's', 'x']
-----------------------------------
Bonus : *list(df.iloc[3, :]) = d i n s x

While there are certainly more ways on how to look up dataframe, the examples I show above should be enough to answer your question (correct me if I'm wrong).

Upvotes: 1

Cathi G
Cathi G

Reputation: 49

While I still want to know if there's a simpler way to do this, this get me just the single value that I'm looking for.

So, the solution is to create a dataframe and index it on key 0. That gets rid of the automatic numerical row ids (0, 1, 2, etc). Then use .loc to look up the row of the value ("thing 1 name") in the key 0 column and return the value in the specified column ("Thing ID").

print(thing_df.loc["thing 1 name","Thing ID"])

Upvotes: 0

Related Questions