nokvk
nokvk

Reputation: 353

Pandas: combine different CSVs in one df combining by the name

I have some different csv files in a same folder with this structure:

CSV 1:

    Name      Passes     Shots
1   Player 1     20        5
2   Player 2     30        6
3   Player 3     10        3

CSV 2:

    Name      Goals     Duels
1   Player 3     2        3
2   Player 1     0        2
3   Player 2     1        7

CSV 3:

    Name      Country     Age
1   Player 2     SPA        25
2   Player 3     SPA        26
3   Player 1     USA        23

I would like to combine this csvs in one dataframe with pandas and the result I want would be:

     Name    Passes     Shots    Goals    Duels    Country    Age
1   Player 1     20        5       0        2        USA       23
2   Player 2     30        6       1        7        SPA       25
3   Player 3     10        3       2        3        SPA       26

I'm trying to combine them with this code but i get a dataframe with 9 rows (Player 1 3 times, Player 2 3 times and Player 3 3 times):

file_extension = ".csv"

all_filenames = [i for i in glob.glob(f"*{file_extension}")]

df = pd.concat([pd.read_csv(file) for file in all_filenames])

The result I get is something like:

     Name    Passes     Shots    Goals    Duels    Country    Age
1   Player 1     20        5      NaN      NaN       NaN      NaN
2   Player 2     30        6      NaN      NaN       NaN      NaN
3   Player 3     10        3      NaN      NaN       NaN      NaN
1   Player 1    NaN      NaN       0        2        NaN      NaN
2   Player 2    NaN      NaN       1        7        NaN      NaN
3   Player 3    NaN      NaN       2        3        NaN      NaN
1   Player 1    NaN      NaN       NaN      NaN      USA       23
2   Player 2    NaN      NaN       NaN      NaN      SPA       25
3   Player 3    NaN      NaN       NaN      NaN      SPA       26

Any idea of how to combine them as I want, with the name as the reference? Thanks in advance!

Upvotes: 1

Views: 100

Answers (2)

epritchard
epritchard

Reputation: 61

A way around drop_duplicates is just by merging on multiple column headers. That way if there is a player that has played on two different teams the player_name will be the same but the team will be different and the merge will recognize that.

You can also use pandas to read in the CSV into a data frame (since you have a lot to read, this can be in a large for-loop).

data1 = pd.read_csv(PATH_TO_CSV = '/your_csv.csv')
data2 = pd.read_csv(PATH_TO_CSV = '/your_second_csv.csv')

merged_df = pd.merge(data1, data2, how='outer', on=['player', 'team']

Something else to note with data frame manipulation is that you can concatenate them together on an axis in order to bring them together. It can look something like this...

df_list = []
for loop {
   data = pd.read_csv(PATH_TO_CSV = '/your_csv.csv')
   # check type of data to make sure its a dataframe
   df_list.append(data)
   # and on to the next csv
}
# after the for loop you can concatenate the data frames together
concat_df = pd.concat(df_list, axis=0, sort=False)
# need to make sure the first df_list is iterable or else you will get an error
# then you can create a new csv with all of your data
concat_df.to_csv(OUTPUT_PATH + '/your_new_csv.csv', index=False)

Upvotes: 1

mujjiga
mujjiga

Reputation: 16856

Use pandas merge and specify the left and right keys

df1 = pd.DataFrame({
    'Name': ['Player 1', 'Player 2', 'Player 3'],
    'Passes': [20, 30, 10],
    'Shots': [5, 6, 3]
})
df2 = pd.DataFrame({
    'Name': ['Player 3', 'Player 1', 'Player 2'],
    'Goals': [2, 0, 1],
    'Duels': [3, 2, 7]
})
df3 = pd.DataFrame({
    'Name': ['Player 2', 'Player 3', 'Player 1'],
    'Country': ['SPA', 'SPA', 'USA'],
    'Age': [25, 26, 23]
})

df1.merge(df2, left_on="Name", right_on="Name").merge(
    df3, left_on="Name", right_on="Name")

Output:

        Name    Passes  Shots   Goals   Duels   Country Age
0   Player 1        20     5    0          2    USA     23
1   Player 2        30     6    1          7    SPA     25
2   Player 3        10     3    2          3    SPA     26

Edit 1:

If you have many such file and all of them have Name as key then you can use:

df = None
for f in list_of_files:
  df = pd.read_csv(f) if df is None else df.merge(
      pd.read_csv(f), left_on="Name", right_on="Name")

Upvotes: 1

Related Questions