Reputation: 111
I have a dataset of US House of Representatives races. While there are 435 districts, I have 439 Democratic candidates, which is too many, and I'm trying to figure out why. I suspect there are runoff races causing this, which I want to test.
>>> democrat_results.head()
states po dist cand party cand_votes tot_votes
0 ALABAMA AL 1 ROBERT KENNEDY JR. DEMOCRAT 89226 242617
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
7 ALABAMA AL 3 MALLORY HAGAN DEMOCRAT 83996 231915
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
13 ALABAMA AL 5 PETER JOFFRION DEMOCRAT 101388 260673
What I'm trying to do is see if any of the state districs (eg. AL 1, AL 2) have two listings. I can figure out how to do this on my own, but the problem I'm having is that whenever I write a for loop to act on the dataframe, it seems to just act on the column headers.
unique_races = []
for row in democrat_results[1:]:
if row not in unique_races:
unique_races.append(row)
# this was "row" and "race", now has been changed to just be row in both cases
unique_races
returns:
['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes']
(I am aware that the loop won't do what I'm looking for, it's just to demonstrate what happens)
How do I avoid this and instead have the for loop act on the rows?
I am aware that for loops are inefficient, but I'm only using a few hundred values and do not know more advanced methods, making it suitable enough for me.
Rest assured that I have spent time looking for an answer, and not found one, leading to me asking this question.
Upvotes: 0
Views: 853
Reputation: 15442
your suggested groupby ("po", "dist") suggestion is on the right track. The issue is comparing the actual items you want from each row. When you loop through the DataFrame, you're getting the entire row. You can see this by inspecting the first item returned by df.iterrows()
:
In [22]: next(df.iterrows())
Out[22]:
(0,
states ALABAMA
po AL
dist 1
cand ROBERT KENNEDY JR.
party DEMOCRAT
cand_votes 89226
tot_votes 242617
Name: 0, dtype: object)
Along with the index, 0
, we have the full row. Since each index and row value is unique in this dataframe, this will never have any match elsewhere in the dataframe. Therefore, your test, if row not in unique_races
will never return False
. Instead, you want to check for the values of po
and dist
:
for ix, row in df.iterrows():
race = (row['po'], row['dist']) # this produces a tuple, e.g. ('AL', 1)
if race not in unique_races:
unique_races.append(row)
This will give you a list with 1 row from each unique race. However, it seems like you're trying to identify the races with two or more candidates. Additionally, as you suggested, this isn't the fastest way to do this.
The compound index ('po', 'dist') does make this tricky. To simplify things, we can add a unique dist_id column:
# convert dist to str so it can be added to the string PO code
df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))
Now, we can count the occurrences of unique dist_ids, and find dist_ids with more than one candidate:
counts_by_distid = df_with_distid.groupby('dist_id').size()
more_than_1 = counts_by_distid[counts_by_distid > 1]
Finally, we can subset the full frame to those races with more than one Democratic candidate:
df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]
To test this out, I modified your subset of the data to include an extra entry:
In [32]: df = pd.DataFrame(
...: columns=['states', 'po', 'dist', 'cand', 'party', 'cand_votes', 'tot_votes'],
...: data=[
...: ['ALABAMA', 'AL', '1', 'ROBERT KENNEDY JR.', 'DEMOCRAT', '89226', '242617'],
...: ['ALABAMA', 'AL', '2', 'TABITHA ISNER', 'DEMOCRAT', '86931', '226230'],
...: ['ALABAMA', 'AL', '3', 'MALLORY HAGAN', 'DEMOCRAT', '83996', '231915'],
...: ['ALABAMA', 'AL', '4', 'LEE AUMAN', 'DEMOCRAT', '46492', '230969'],
...: ['ALABAMA', 'AL', '5', 'PETER JOFFRION', 'DEMOCRAT', '101388', '260673'],
...: ['ALABAMA', 'AL', '5', 'JANE DOE', 'DEMOCRAT', '159285', '260673'],
...: ])
The above code results in the following:
In [33]: df_with_distid = df.assign(dist_id=(df['po'] + df['dist'].astype(str)))
In [34]: counts_by_distid = df_with_distid.groupby('dist_id').size()
...: more_than_1 = counts_by_distid[counts_by_distid > 1]
In [35]: df_with_distid[df_with_distid.dist_id.isin(more_than_1.index)]
Out[35]:
states po dist cand party cand_votes tot_votes dist_id
4 ALABAMA AL 5 PETER JOFFRION DEMOCRAT 101388 260673 AL5
5 ALABAMA AL 5 JANE DOE DEMOCRAT 159285 260673 AL5
Upvotes: 1
Reputation: 45
code:
class Data:
results = [
"states po dist cand party cand_votes tot_votes",
"0 ALABAMA AL 1 ROBERT KENNEDY JR. DEMOCRAT 89226 242617",
"4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230",
"4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230",
"4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230",
"7 ALABAMA AL 3 MALLORY HAGAN DEMOCRAT 83996 231915",
"10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969",
"13 ALABAMA AL 5 PETER JOFFRION DEMOCRAT 101388 260673",
"10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969",
]
done = []
def search_data(po, dist):
results = []
for line in Data.results[1:]:
line = str(line)
line = line.split(' ')
line = ' '.join(x for x in line if x != '')
line = line.split(' ')
if po == line[2]:
if dist == line[3]:
line = ' '.join(x for x in line)
results.append(line)
try:
d = results[1]
if d not in done:
done.append(d)
print('\n==============================================')
for data in results:
print(data)
print('==============================================')
except:
# print('result has only 1 result in it. No dupes.')
pass
for line in Data.results[1:].copy():
line = str(line)
line = line.split(' ')
line = ' '.join(x for x in line if x != '')
line = line.split(' ')
po = line[2]
dist = line[3]
search_data(po, dist)
output:
==============================================
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
4 ALABAMA AL 2 TABITHA ISNER DEMOCRAT 86931 226230
==============================================
==============================================
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
10 ALABAMA AL 4 LEE AUMAN DEMOCRAT 46492 230969
==============================================
This code above will loop through the dataset and index the PO and DIST of all lines and then print out in a group all the dupes. If no dupes nothing will be printed out for that particuler PO and DIST.
So basically running the code on your dataset should print out all the dupes that you want in a group with ================= and \n between to make veiwing easier.
:)
Upvotes: 1