Dacromir
Dacromir

Reputation: 228

Why is PowerBI two orders of magnitude slower than Python to join two files?

I have two csv files:

I'm trying to merge (join) them on a single column. On my computer, it takes 20-25 minutes to load data from the CSVs and join them in PowerBI. If I perform the same operation in python using Pandas dataframes, it takes 15-20 seconds. (Queries & code at end of question)

I would expect that python would be somewhat faster than PowerBI, but not by two orders of magnitude. These files aren't tiny, but they're not exceptionally large either (by data analysis standards). I would expect PowerBI to handle a common data manipulation task like this with relatively good performance using default settings.

Am I doing something wrong? Is there a way to speed up these merges in PowerBI? Or is PowerBI totally unsuited for a task like this?

PowerBI Queries

I have three PowerBI queries to accomplish this task. The first two were auto-generated by clicking New Source -> Text/CSV. They're load-disabled and only exist to be merged by the third.

FileA:

let
 Source = Csv.Document(File.Contents("path/to/a.csv"),[Delimiter=",", Columns=5, Encoding=1252, QuoteStyle=QuoteStyle.None]),
 #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
 #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type datetime}, {"ColB", type text}, {"ColC", type text}, {"ColD", type text}, {"ColE", type text}})
in
 #"Changed Type"

FileB:

let
 Source = Csv.Document(File.Contents("path/to/b.csv"),[Delimiter=",", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
 #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
 #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type text}, {"ColB", type datetime}})
in
 #"Changed Type"

Merge:

let
 Source = Table.NestedJoin(FileA, {"ColC"}, FileB, {"ColA"}, "FileB", JoinKind.LeftOuter),
 #"Expanded FileB" = Table.ExpandTableColumn(Source, "FileB", {"ColA", "ColB"}, {"FileB.ColA", "FileB.ColB"})
in
 #"Expanded FileB"

Using PowerBI Desktop. Version: 2.128.952.0 64-bit (April 2024)

Python Code

import pandas as pd
 
# Load dataframes
with open("path/to/a.csv", "r") as file:
 df_a = pd.read_csv(file)
with open("path/to/b.csv", "r") as file:
 df_b = pd.read_csv(file)
 
# Merge dataframes
df = pd.merge(
 df_a,
 df_b,
 how='left',
 left_on="ColC",
 right_on="ColA"
)

Using Python 3.10.9 & Pandas 1.5.3

Upvotes: 1

Views: 165

Answers (2)

Evgeny Avdeev
Evgeny Avdeev

Reputation: 191

You can use Table.Join() instead of Table.NestedJoin() with Table.ExpandTableColumn(). Table.ExpandTableColumn() is very slow.

Merge step will be somthing like this:

let
 Source = Table.Join(FileA, {"ColC"}, FileB, {"ColA"}, JoinKind.LeftOuter)
in
 Source 

But you should check that columns have different names in both tables before Table.Join(). You can add prefixes to column names using Table.PrefixColumns() to guarantee this.

Upvotes: 1

davidebacci
davidebacci

Reputation: 30174

Does buffering into memory help? e.g.

let
 a = Table.Buffer(FileA),
 b = Table.Buffer(FileB),
 Source = Table.NestedJoin(a, {"ColC"}, b, {"ColA"}, "FileB", JoinKind.LeftOuter),
 #"Expanded FileB" = Table.ExpandTableColumn(Source, "FileB", {"ColA", "ColB"}, {"FileB.ColA", "FileB.ColB"})
in
 #"Expanded FileB"

Upvotes: 1

Related Questions