Concatenate two pandas dataframes for analysis

Question

I'm trying to solve a problem about customer preferences on restaurants. I have two different CSVs, one which has the customer information:

And the other one has restaurant ratings:

So I want to try out supervised training based on customer preferences in order to determine what the restaurant rating will be. In order for this to happen I think I have to append the customer information for each rating (so I will have variables to analyze).

Im trying this using python & pandas.

I have tried this:

import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from pandas.plotting import scatter_matrix
import numpy as np


df1 = pd.read_csv('/2_user_profile.csv', index_col = [0])
df2 = pd.read_csv('/3_Ratings.csv')

#Create empty dataframe with named columns
df = pd.DataFrame(columns=(np.concatenate((df2.columns.values, df1.columns.values), axis=0)))

#Joining the tables
for index, row in df2.iterrows():
    userID= row['userID']
    frame=[row, df1.loc[userID]]
    print(frame)
    df = pd.concat([df, pd.DataFrame(frame)], axis=0)

print(df)

The print(frame) will give me this result:

And that makes sense but when I print df it gives me this:

Which means that each frame I create makes a double record on the dataframe, one with the information from df1 and all values that correspond to data on df2 are empty and another register for all values on df2 and empty values that correspond to df1.

This is my first go at python+machine learning so let me know if you also have a comment on my approach.

smj · Accepted Answer

Looks like you want to join on the userID in both dataframes, right?

You can do this using merge. Here is a short example:

import pandas as pd

data_1 = pd.DataFrame({'id': ['A', 'B'] * 5, 'value_1': [0, 1] * 5})
data_2 = pd.DataFrame({'id': ['A', 'B'], 'value_2': [3, 4]})

data_1.merge(data_2, how = 'inner', left_on = 'id', right_on = 'id')

Gives:

Concatenate two pandas dataframes for analysis

Answers (1)

Related Questions