BingbongKim
BingbongKim

Reputation: 579

Python - How to improve the dataframe performance?

There are 2 CSV files. Each file has 700,000 rows.

I should read one file line by line and find the same row from the other file.

After then, make two files data as one file data.

But, It takes about 1 minute just per 1,000 rows!!

I don't know how to improve the performance.

Here is my code :

import pandas as pd

fail_count = 0 
match_count = 0 
count = 0 

file1_df = pd.read_csv("Data1.csv", sep='\t')
file2_df = pd.read_csv("Data2.csv", sep='\t')

columns = ['Name', 'Age', 'Value_file1', 'Value_file2']
result_df = pd.DataFrame(columns=columns)

for row in fil1_df.itterow():

    name = row[1][2]
    position = row[1][3]

    selected = file2_df[(file2_df['Name'] == name ) & (file2_df['Age'] == age)]

    if selected.empty :
        fail_count += 1
        continue

    value_file1 = row[1][4]
    value_file2 = selected['Value'].values[0]

    result_df.loc[len(result_df)] = [name, age, value_file1, value_file2]

    match_count += 1

print('match : ' + str(match_count))
print('fail : ' + str(fail_count))
result_df.to_csv('result.csv', index=False, encoding='utf-8')

Which line can be changed?

Is there any other way to do this process?

Upvotes: 0

Views: 124

Answers (1)

Adrian Mc
Adrian Mc

Reputation: 148

This might be too simplistic, but have you tried using pandas.merge() functionality?

See here for syntax.

For your tables:

result_df = pd.merge(left=file1_df, right=file2_df, on=['Name', 'Age'], how='inner')

That will do an "inner" join, only keeping rows with Names & Ages that match in both tables.

Upvotes: 2

Related Questions