Demosthenes
Demosthenes

Reputation: 111

Trying to iterate over rows of dataframe but only accessing column headers

I have a dataset of US House of Representatives races. While there are 435 districts, I have 439 Democratic candidates, which is too many, and I'm trying to figure out why. I suspect there are runoff races causing this, which I want to test.

>>> democrat_results.head()
    states  po  dist    cand                party       cand_votes  tot_votes
0   ALABAMA AL  1       ROBERT KENNEDY JR.  DEMOCRAT    89226       242617
4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230
7   ALABAMA AL  3       MALLORY HAGAN       DEMOCRAT    83996       231915
10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969
13  ALABAMA AL  5       PETER JOFFRION      DEMOCRAT    101388      260673

What I'm trying to do is see if any of the state districs (eg. AL 1, AL 2) have two listings. I can figure out how to do this on my own, but the problem I'm having is that whenever I write a for loop to act on the dataframe, it seems to just act on the column headers.

unique_races = []

for row in democrat_results[1:]:
   if row not in unique_races:
        unique_races.append(row)
# this was "row" and "race", now has been changed to just be row in both cases

unique_races returns:

['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes']

(I am aware that the loop won't do what I'm looking for, it's just to demonstrate what happens)

How do I avoid this and instead have the for loop act on the rows?

I am aware that for loops are inefficient, but I'm only using a few hundred values and do not know more advanced methods, making it suitable enough for me.

Rest assured that I have spent time looking for an answer, and not found one, leading to me asking this question.

Upvotes: 0

Views: 853

Answers (2)

Michael Delgado
Michael Delgado

Reputation: 15442

your suggested groupby ("po", "dist") suggestion is on the right track. The issue is comparing the actual items you want from each row. When you loop through the DataFrame, you're getting the entire row. You can see this by inspecting the first item returned by df.iterrows():

In [22]: next(df.iterrows())
Out[22]:
(0,
 states                   ALABAMA
 po                            AL
 dist                           1
 cand          ROBERT KENNEDY JR.
 party                   DEMOCRAT
 cand_votes                 89226
 tot_votes                 242617
 Name: 0, dtype: object)

Along with the index, 0, we have the full row. Since each index and row value is unique in this dataframe, this will never have any match elsewhere in the dataframe. Therefore, your test, if row not in unique_races will never return False. Instead, you want to check for the values of po and dist:

for ix, row in df.iterrows():
    race = (row['po'], row['dist'])  # this produces a tuple, e.g. ('AL', 1)
    
    if race not in unique_races:
        unique_races.append(row)

This will give you a list with 1 row from each unique race. However, it seems like you're trying to identify the races with two or more candidates. Additionally, as you suggested, this isn't the fastest way to do this.

The compound index ('po', 'dist') does make this tricky. To simplify things, we can add a unique dist_id column:

# convert dist to str so it can be added to the string PO code
df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))

Now, we can count the occurrences of unique dist_ids, and find dist_ids with more than one candidate:

counts_by_distid = df_with_distid.groupby('dist_id').size()
more_than_1 = counts_by_distid[counts_by_distid > 1]

Finally, we can subset the full frame to those races with more than one Democratic candidate:

df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]

To test this out, I modified your subset of the data to include an extra entry:

In [32]: df = pd.DataFrame(
    ...:     columns=['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes'],
    ...:     data=[
    ...:         ['ALABAMA', 'AL', '1', 'ROBERT KENNEDY JR.', 'DEMOCRAT', '89226', '242617'],
    ...:         ['ALABAMA', 'AL', '2', 'TABITHA ISNER', 'DEMOCRAT', '86931', '226230'],
    ...:         ['ALABAMA', 'AL', '3', 'MALLORY HAGAN', 'DEMOCRAT', '83996', '231915'],
    ...:         ['ALABAMA', 'AL', '4', 'LEE AUMAN', 'DEMOCRAT', '46492', '230969'],
    ...:         ['ALABAMA', 'AL', '5', 'PETER JOFFRION', 'DEMOCRAT', '101388', '260673'],
    ...:         ['ALABAMA', 'AL', '5', 'JANE DOE', 'DEMOCRAT', '159285', '260673'],
    ...: ])

The above code results in the following:

In [33]: df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))

In [34]: counts_by_distid = df_with_distid.groupby('dist_id').size()
    ...: more_than_1 = counts_by_distid[counts_by_distid > 1]

In [35]: df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]
Out[35]:
    states  po dist            cand     party cand_votes tot_votes dist_id
4  ALABAMA  AL    5  PETER JOFFRION  DEMOCRAT     101388    260673     AL5
5  ALABAMA  AL    5        JANE DOE  DEMOCRAT     159285    260673     AL5

Upvotes: 1

n c
n c

Reputation: 45

code:

class Data:
    results = [ 
    "states  po  dist    cand                party       cand_votes  tot_votes",
    "0   ALABAMA AL  1       ROBERT KENNEDY JR.  DEMOCRAT    89226       242617",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "4   ALABAMA AL  2       TABITHA ISNER       DEMOCRAT    86931       226230",
    "7   ALABAMA AL  3       MALLORY HAGAN       DEMOCRAT    83996       231915",
    "10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969",
    "13  ALABAMA AL  5       PETER JOFFRION      DEMOCRAT    101388      260673",
    "10  ALABAMA AL  4       LEE AUMAN           DEMOCRAT    46492       230969",
    ]
done = []
def search_data(po, dist):
    results = []
    for line in Data.results[1:]:
            line = str(line)
            line = line.split(' ')
            line = ' '.join(x for x in line if x != '')
            line = line.split(' ')
            if po == line[2]:
                if dist == line[3]:
                    line = ' '.join(x for x in line)
                    results.append(line)

    try:
        d = results[1]
        if d not in done:
            done.append(d)
            print('\n==============================================')
            for data in results:
                print(data)
            print('==============================================')
    except:
        # print('result has only 1 result in it. No dupes.')
        pass


for line in Data.results[1:].copy():
    line = str(line)
    line = line.split(' ')
    line = ' '.join(x for x in line if x != '')
    line = line.split(' ')
    po = line[2]
    dist = line[3]
    search_data(po, dist)

output:

==============================================
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
==============================================

==============================================
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
==============================================

This code above will loop through the dataset and index the PO and DIST of all lines and then print out in a group all the dupes. If no dupes nothing will be printed out for that particuler PO and DIST.

So basically running the code on your dataset should print out all the dupes that you want in a group with ================= and \n between to make veiwing easier.

:)

Upvotes: 1

Related Questions