Pandas: how to remove duplicate indexes resulting from concatenated data frames?

Question

I have some code that retrieves data from an internal database, then concatenates the data as a Pandas dataframe. The issue I'm running into as that there are duplicate indexes in the dataframe. Here is the full code:

from jira import JIRA
import pandas as pd

cert_path = 'C:\cert.crt'

start_date = '2020-10-01'
end_date = '2020-10-31'

# three different instances (each with their own schema)
a_session = JIRA(server='https://jira.myinstance-A.com', options={'verify': cert_path}, kerberos=True)

b_session = JIRA(server='https://jira.myinstance-B.com', options={'verify': cert_path}, kerberos=True)

c_session = JIRA(server='https://jira.myinstance-C.com', options={'verify': cert_path}, kerberos=True)


# define Jira queries
query_1 = 'project = \"Test Project 1\" and issuetype = Incident and resolution = Resolved and updated >= {} and updated <= {}'.format(start_date, end_date)

query_2 = 'project = \"Test Project 2\" and issuetype = Incident and resolution = Resolved and updated >= {} and updated <= {}'.format(start_date, end_date)

query_3 = 'project = \"Test Project 3\" and issuetype = Defect and resolution = Resolved and releasedate >= {} and releasedate <= {}'.format(start_date, end_date)

query_4 = 'project = \"Test Project 4\" and issuetype = Enhancement and resolution = Done and completed >= {} and completed <= {}'.format(start_date, end_date)


# fetch all issues from a given session for a given query
block_size = 100
block_num = 0


def get_all_issues(session, query):

    block_size = 50
    block_num = 0
    
    start = 0
    all_issues = []
    while True:
        issues = session.search_issues(query, start, block_size)
        if len(issues) == 0:
            # No more issues
            break
        start += len(issues)
        for issue in issues:
            all_issues.append(issue)

    issues = pd.DataFrame(issues)

    for issue in all_issues:
        d = {
            'jira_key' : issue.key,
            'issue_type' : issue.fields.type,
            'creator' : issue.fields.creator,
            'resolution' : issue.fields.resolution
             }

        fields = issue.fields  # For brevity
        if hasattr(fields, "custom_field_123"):
            d['system_change'] = fields.custom_field_123
        if hasattr(fields, "custom_field_456"):
            d['system_resources'] = fields.custom_field_456
        if hasattr(fields, "custom_field_789"):
            d['system_backup'] = fields.custom_field_789

        issues = issues.append(d, ignore_index=True)

    return issues


# list of queries, and the corresponding backend
queries = [
    (a_session, query_1),
    (a_session, query_2),
    (b_session, query_3),
    (c_session, query_4),
]


# loop over each pair of session and query, calling the get_all_issues function, and save the dataframe we get each time
dataframes = []

for session, query in queries:
    dataframe = get_all_issues(session, query)
    dataframes.append(dataframe)


# concatenate all data frames
df_concat = pd.concat(dataframes)

# derive the business units from the project codes
df_concat['business_unit'] = np.where(df_concat['jira_key'].str.contains('MER'), 'Mercedes',
                             np.where(df_concat['jira_key'].str.contains('HON'), 'Honda',
                             np.where(df_concat['jira_key'].str.contains('AST'), 'Aston Martin', '*ERROR*')))

Here is what some sample data would look like (notice the duplicate indexes):

   jira key   issue_type    creator   resolution   system_change   system_resources   system_backup   business_unit
0  MER-361    Incident      Smith, J  Resolved     Saturn          High               NaN             Mercedes
1  MER-362    Enhancement   Jones, T  In Progress  NaN             Medium             Not Applicable  Mercedes
2  MER-363    Incident      Ng, V     Resolved     Saturn          NaN                Not Applicable  Mercedes
3  MER-364    Incident      Jones, T  Resolved     NaN             NaN                Not Applicable  Mercedes
0  AST-022    Incident      Smith, J  Resolved     Saturn          High               NaN             Astin Martin
1  AST-023    Incident      Smith, J  Resolved     Saturn          High               NaN             Astin Martin
2  AST-024    Incident      Jones, T  Resolved     Saturn          High               NaN             Astin Martin
0  HON-124    Incident      Smith, J  In Progress  NaN             Low                NaN             Honda
1  HON-125    Incident      Smith, J  Resolved     Saturn          High               NaN             Honda
2  HIN-126    Incident      Jones, T  In Progress  Saturn          Low                NaN             Honda

What is the best way to "arrange" the indexes such that they are sequential? (as follows)

   jira key   issue_type    creator   resolution   system_change   system_resources   system_backup   business_unit
0  MER-361    Incident      Smith, J  Resolved     Saturn          High               NaN             Mercedes
1  MER-362    Enhancement   Jones, T  In Progress  NaN             Medium             Not Applicable  Mercedes
2  MER-363    Incident      Ng, V     Resolved     Saturn          NaN                Not Applicable  Mercedes
3  MER-364    Incident      Jones, T  Resolved     NaN             NaN                Not Applicable  Mercedes
4  AST-022    Incident      Smith, J  Resolved     Saturn          High               NaN             Astin Martin
5  AST-023    Incident      Smith, J  Resolved     Saturn          High               NaN             Astin Martin
6  AST-024    Incident      Jones, T  Resolved     Saturn          High               NaN             Astin Martin
7  HON-124    Incident      Smith, J  In Progress  NaN             Low                NaN             Honda
8  HON-125    Incident      Smith, J  Resolved     Saturn          High               NaN             Honda
9  HIN-126    Incident      Jones, T  In Progress  Saturn          Low                NaN             Honda

Thanks in advance!

noah · Accepted Answer

Just add the ignore_index (Docs) parameter:

df_concat = pd.concat(dataframes, ignore_index=True)

After the concat has occurred you can also use reset_index() (Docs) which will have the same effect.

Pandas: how to remove duplicate indexes resulting from concatenated data frames?

Answers (1)

Related Questions