user3782816
user3782816

Reputation: 171

Identifying elements in a dataframe

I have a dictionary of dataframes called names_and_places in pandas that looks like the below.

names_and_places:

Alfred,,,
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
Brett,,,
Date,F_1,F_2,Key
4/1/2020,202,404,NAN
4/2/2020,101,401,NAN
4/3/2020,102,403,"[USA,CT, Fairfield, Stamford] "
Claire,,,
Date,F_1,F_2,Key
4/1/2020,NAN,12,NAN
4/2/2020,NAN,45,NAN
4/3/2020,7,78,"[USA,CT, Fairfield, Darian] "
Dane,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, New Haven] "
Edward,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN   
4/3/2020,7,19,"[USA,CT, Bridgeport, Milford] "

(text above or image below) enter image description here

The key column is either going to be NAN or of the form [Country, State, County, City], but can be of length 3 or 4 elements (sometimes County is absent). I need to find all the elements with a given element that is contained in a key. For instance if the element = "CT", the script returns Edward, Brett, Dane and Claire (order is not important). If the element = "Stamford" then only Brett is returned. However I am going about the identification process in a way that seems very inefficient. I basically have variables that iterate through each possible combination of State, County, City (all of which I am currently manually inputting into variables) to identify which names to extract like below:

    country = 'USA' #this never needs to change
    element = 'CT'
    #These next two are actually in .txt files that I create once I am asked for 
    #a given breakdown but I would like to not have to manually input these
    middle_node = ['Fairfield','Bridgeport'] 
    terminal_nodes = ['Stamford','Darian','New Haven','Milford']
    names=[]
    for a in middle_node:
        for b in terminal_nodes:
            my_key = [country,key_of_interest,a,b]
            for s in names_and_places:
                for z in names_and_places[s]['Key']:
                    if my_key == z: 
                        names.append(s)
                #Note having "if my_key in names_and_places[s]['Key']": was causing sporadic failures for 
                #some reason 
    display(names)       

Output:

    Edward, Brett, Dane, Claire

What I would like is to be able to input only the variable element and this can either be a level 2 (State), 3 (County), or 4 (City) node. However short of adding additional for loops and going into the Key column, I don't know how to do this. The one benefit (for a novice like myself) is that the double for loops allow me to keep bucketing intact and makes it easier for people to see where names are coming from when that is also needed.

But is there a better way? For bonus points if there is a way to handle the case when the key_of_interest is 'NY' and values in the Keys column can be like [USA, NY, NY, NY] or [USA, NY, NY, Queens].

Edit: names_and_places is a dictionary with names as the index, so

    display(names_and_places['Alfred'])

would be

    Date,F_1,F_2,Key
    4/1/2020,1,4,NAN
    4/2/2020,2,5,NAN
    4/3/2020,3,6,"[USA,NY,NY, NY]"

I do have the raw dataframe that has columns:

    Date, Field name, Value, Names,

Where Field Name is either F_1, F_2 or Key and Value is the associated value of that field. I then pivot the data on Name with columns of Field Name to make my extraction easier.

Upvotes: 0

Views: 96

Answers (1)

Roy2012
Roy2012

Reputation: 12523

Here's a way to do that in a somewhat more effective way. You start by building a single dataframe out of the dictionary, and then do the actual work on that dataframe.

single_df = pd.concat([df.assign(name = k) for k, df in names_and_places.items()])
single_df["Key"] = single_df.Key.replace("NAN", np.NaN)
single_df.dropna(inplace=True)

# Since the location is a string, we have to parse it. 
location_df = pd.DataFrame(single_df.Key.str.replace(r"[\[\]]", "").str.split(",", expand=True))
location_df.columns = ["Country", "State", "County", "City"]
single_df = pd.concat([single_df, location_df], axis=1)

# this is where the actual query goes. 
single_df[(single_df.Country == "USA") & (single_df.State == "CT")].name

The output is:

2     Brett
2    Claire
2      Dane
2    Edward
Name: name, dtype: object

Upvotes: 1

Related Questions