Ignacio Riveros Godoy
Ignacio Riveros Godoy

Reputation: 25

How can I iterate over values of a dictionary in an efficient way?

I'm running a school assignment matching algorithm using dictionaries. All the process of the algorithm is relatively efficient, except for the part where I need to export the results to a .csv.

students is a dictionary with 483.070 pairs of key-value. The key is an integer with an id, and the value is a Student class object that I create. Actually, to export results I'm using the following methods.

def parse_student_match_information(student: Student) -> int:
    if student.assigned_vacancy is None:
        return 0
    return student.assigned_vacancy.program_id

def get_assignation_output(students: dict)-> pd.DataFrame:
    result = pd.DataFrame(columns = ['Student_ID', 'Program_ID', 'Grade_ID'])
    for student in students.values():
        program_id = parse_student_match_information(student)
        result = result.append({'Student_ID': student.id, 'Program_ID': program_id, 'Grade_ID': student.grade}, ignore_index = True)
    return result.sort_values('Grade_ID')

It took more than an hour to produce this pd.DataFrame. Any suggestion is welcome!

Upvotes: 0

Views: 47

Answers (1)

gold_cy
gold_cy

Reputation: 14216

Generally you don't want to append to a DataFrame but instead create it from an iterable, a better way would be as shown below.

def parse_student_match_information(student: Student) -> int:
    if student.assigned_vacancy is None:
        return 0
    return student.assigned_vacancy.program_id

def get_assignation_output(students: dict) -> Iterable[dict]:
    for student in students.values():
        program_id = parse_student_match_information(student)
        result = {'Student_ID': student.id, 'Program_ID': program_id, 'Grade_ID': student.grade}
        yield result

def make_df(rows: Iterable[dict]) -> pd.DataFrame:
    df = pd.DataFrame(rows, columns=['Student_ID', 'Program_ID', 'Grade_ID'])
    df.sort_values(by=['Grade_ID'])
    return df

This way you create the DataFrame from all the rows at once and then sort it once at the very end as opposed to each iteration. You should see improvements in terms of performance from this.

Upvotes: 1

Related Questions