Reputation: 2573
I have a csv file that contains repeated items. The reason that the items are repeated is that each row is indicating a specific loan that has been paid during time. For example, loan with ID=156 has been paid in 5 payments. So the loan information is repeated in each row except the payment amount and payment date is different. I want to make a bar chart of the paid loans but counting the paid loans by counting the rows that their status is paid is not working as it counts repeated loans. I am wondering how I should fix this problem.
Here is my code :
loans[loans["status"]=="paid"] ["location.country"].value_counts().plot(kind="bar", fontsize=7)
This code is trying to plot the countries that paid their loans but as I said it is not correct as loans["status"]=="paid"
could be TRUE for a loan that is repeated in multiple rows, therefore, loans[loans["status"]=="paid"] ["location.country"].value_counts()
is the wrong count.
Here is an example of how data looks:
ID status paymentAmount paymentProcessDate Country
156 paid $400 2/12/2009 CountryA
156 paid $123 2/11/2010 CountryA
156 paid $2000 2/09/2011 CountryA
134 unpaid Nan 2/05/2013 CountryB
134 unpaid Nan 2/07/2013 CountryB
Upvotes: 1
Views: 96
Reputation: 11602
Here is one way to filter out ID
s that have already been paid.
import pandas as pd
from io import StringIO
df = pd.DataFrame.from_csv(StringIO('''ID status paymentAmount paymentProcessDate Country
156 paid $400 2/12/2009 CountryA
156 paid $123 2/11/2010 CountryA
156 paid $2000 2/09/2011 CountryA
134 unpaid Nan 2/05/2013 CountryB
134 unpaid Nan 2/07/2013 CountryB'''), sep=r' +')
paid_ids = df[df.status=='paid'].ID.drop_duplicates()
unpaid = df[~df.ID.isin(paid_ids)]
unpaid.Country.value_counts().plot(kind='bar')
Upvotes: 2