Reputation: 228
I have two csv files:
I'm trying to merge (join) them on a single column. On my computer, it takes 20-25 minutes to load data from the CSVs and join them in PowerBI. If I perform the same operation in python using Pandas dataframes, it takes 15-20 seconds. (Queries & code at end of question)
I would expect that python would be somewhat faster than PowerBI, but not by two orders of magnitude. These files aren't tiny, but they're not exceptionally large either (by data analysis standards). I would expect PowerBI to handle a common data manipulation task like this with relatively good performance using default settings.
Am I doing something wrong? Is there a way to speed up these merges in PowerBI? Or is PowerBI totally unsuited for a task like this?
PowerBI Queries
I have three PowerBI queries to accomplish this task. The first two were auto-generated by clicking New Source
-> Text/CSV
. They're load-disabled and only exist to be merged by the third.
FileA:
let
Source = Csv.Document(File.Contents("path/to/a.csv"),[Delimiter=",", Columns=5, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type datetime}, {"ColB", type text}, {"ColC", type text}, {"ColD", type text}, {"ColE", type text}})
in
#"Changed Type"
FileB:
let
Source = Csv.Document(File.Contents("path/to/b.csv"),[Delimiter=",", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ColA", type text}, {"ColB", type datetime}})
in
#"Changed Type"
Merge:
let
Source = Table.NestedJoin(FileA, {"ColC"}, FileB, {"ColA"}, "FileB", JoinKind.LeftOuter),
#"Expanded FileB" = Table.ExpandTableColumn(Source, "FileB", {"ColA", "ColB"}, {"FileB.ColA", "FileB.ColB"})
in
#"Expanded FileB"
Using PowerBI Desktop. Version: 2.128.952.0 64-bit (April 2024)
Python Code
import pandas as pd
# Load dataframes
with open("path/to/a.csv", "r") as file:
df_a = pd.read_csv(file)
with open("path/to/b.csv", "r") as file:
df_b = pd.read_csv(file)
# Merge dataframes
df = pd.merge(
df_a,
df_b,
how='left',
left_on="ColC",
right_on="ColA"
)
Using Python 3.10.9 & Pandas 1.5.3
Upvotes: 1
Views: 165
Reputation: 191
You can use Table.Join() instead of Table.NestedJoin() with Table.ExpandTableColumn(). Table.ExpandTableColumn() is very slow.
Merge step will be somthing like this:
let
Source = Table.Join(FileA, {"ColC"}, FileB, {"ColA"}, JoinKind.LeftOuter)
in
Source
But you should check that columns have different names in both tables before Table.Join(). You can add prefixes to column names using Table.PrefixColumns() to guarantee this.
Upvotes: 1
Reputation: 30174
Does buffering into memory help? e.g.
let
a = Table.Buffer(FileA),
b = Table.Buffer(FileB),
Source = Table.NestedJoin(a, {"ColC"}, b, {"ColA"}, "FileB", JoinKind.LeftOuter),
#"Expanded FileB" = Table.ExpandTableColumn(Source, "FileB", {"ColA", "ColB"}, {"FileB.ColA", "FileB.ColB"})
in
#"Expanded FileB"
Upvotes: 1