Reputation: 228

Why is PowerBI two orders of magnitude slower than Python to join two files?

I have two csv files:

File A: 2.7m rows, 5 columns, 350MB
File B: 4.5m rows, 2 columns, 185MB

I'm trying to merge (join) them on a single column. On my computer, it takes 20-25 minutes to load data from the CSVs and join them in PowerBI. If I perform the same operation in python using Pandas dataframes, it takes 15-20 seconds. (Queries & code at end of question)

I would expect that python would be somewhat faster than PowerBI, but not by two orders of magnitude. These files aren't tiny, but they're not exceptionally large either (by data analysis standards). I would expect PowerBI to handle a common data manipulation task like this with relatively good performance using default settings.

Am I doing something wrong? Is there a way to speed up these merges in PowerBI? Or is PowerBI totally unsuited for a task like this?

PowerBI Queries

I have three PowerBI queries to accomplish this task. The first two were auto-generated by clicking New Source -> Text/CSV. They're load-disabled and only exist to be merged by the third.

FileA:

let
 Source = Csv.Document(File.Contents("path/to/a.csv"),[Delimiter=",", Columns=5, Encoding=1252, QuoteStyle=QuoteStyle.None]),
 #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
 #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type datetime}, {"ColB", type text}, {"ColC", type text}, {"ColD", type text}, {"ColE", type text}})
in
 #"Changed Type"

FileB:

let
 Source = Csv.Document(File.Contents("path/to/b.csv"),[Delimiter=",", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
 #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
 #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type text}, {"ColB", type datetime}})
in
 #"Changed Type"

Merge:

let
 Source = Table.NestedJoin(FileA, {"ColC"}, FileB, {"ColA"}, "FileB", JoinKind.LeftOuter),
 #"Expanded FileB" = Table.ExpandTableColumn(Source, "FileB", {"ColA", "ColB"}, {"FileB.ColA", "FileB.ColB"})
in
 #"Expanded FileB"

Using PowerBI Desktop. Version: 2.128.952.0 64-bit (April 2024)

Python Code

import pandas as pd
 
# Load dataframes
with open("path/to/a.csv", "r") as file:
 df_a = pd.read_csv(file)
with open("path/to/b.csv", "r") as file:
 df_b = pd.read_csv(file)
 
# Merge dataframes
df = pd.merge(
 df_a,
 df_b,
 how='left',
 left_on="ColC",
 right_on="ColA"
)

Using Python 3.10.9 & Pandas 1.5.3

Upvotes: 1

Why is PowerBI two orders of magnitude slower than Python to join two files?

Answers (2)

Related Questions