Charafeddine2
Charafeddine2

Reputation: 11

Join multiple csv files into a single table using SSIS or SQL

SSIS beginner here, working on project that follows coding languages performance in github throughout the last decade. I have 3 datasets :

  1. Repos
  2. Pull Requests
  3. Issues.

Issues dataset is as below:

Example of issues dataset

and here is an example of Repos dataset

Repos dataset.

First issue of mine, I don't need the quarter. Second, and what brought me here, how to merge these flat files to a single table and sum data to be

Language | Nbr of repos | Nbr of Pull requests | Year

Help would be much appreciated, I've tried many ways to figure it out but haven't been able to achieve what I'm looking for.

Upvotes: 0

Views: 592

Answers (1)

billinkc
billinkc

Reputation: 61259

1 - If you don't need data, don't use it. Preferably, don't bring it into the dataflow pipeline but just because it's there, it doesn't mean you need to use it.

I would advocate landing each file/tab/dataset into a matching table in your database. It allows you to ensure you've landed exactly the same data that was provided. Once that's done, write a custom query to join the assorted tables together. Something like

SELECT
    R.language
,   R.num_repos AS [Nbr of repos]
,   SUM(PR.[count]) AS [Nbr of Pull requests]
,   PR.year
FROM
    dbo.repos AS R
    -- Assuming inner is accurate
    INNER JOIN
        dbo.PullRequests AS PR
        ON PR.name = R.language
GROUP BY
    R.language
,   R.num_repos
,   PR.year;

The SUM and GROUP BY reflects the fact that you likely want to get the total counts per year and since the data appears to be reported at a quarterly interval, that should wrap that up. Or, it could be that this is semi-additive data you need to get the last available quarter for a given name and count (Language monkeyBrainz got renamed to LaconicApe in Q3 2021 so the last entry for MB is in Q3 2021 and LA starts Q4, 2021)

You can then use that custom query as a source in SSIS to generate your extract (to CSV, to Excel, to Apooche Flerken - it'll be huge in 2033)

SSIS route

Bringing data together has a few options in SSIS: Merge Join, Lookup, Script Task.

Merge Join requires data to be sorted based on the key(s). In this case, Name/Language needs to have a Sort transformation from the source component (flat file/excel?). That gets fed into the Merge Join.

If you have all 4 quarters in the pull requests and you want to total stats for the year, you'll likely want to have the Aggregate transformation between the source and the Sort to SUM the count by name and year or if it's semi-additive, then find the last quarter for name and year.

A non-sorting method would involve two data flows and the lookup component. A lookup can make, at most, one match. In the supplied screenshots, I would assume that the Pull Requests is the "driver" table as C++ will be in there from "year of inception" to current. Whereas your repos dataset would only have C++ in there once. So, we will augment the "current" row from pull requests with a few fields from the repos data set.

The only hiccup here is that the Lookup supports data from an OLE DB provider or the SSIS Cache Connection Manager. Again, assuming your data is Flat File/Excel, we'll need to use a Cache Connection Manager.

Data Flow 1 will "prime the pump" with our Repo data. Your Data Flow will be Source -> Cache Connection Transformation

Define the Key in your cache connection manager as the Language

Cache Connection Manager

In Data Flow 2, you'll use a Lookup Transformation with a source of Cache Connection Manager and then map Name to Language and click the check box next to Language and num_repos to bring them into your pipeline.

Finally, write the results to wherever you need them.

Upvotes: 1

Related Questions