themrdan
themrdan

Reputation: 117

Creating data frame conditionally based on 3 data frames

I have the following 3 data frames:

dfSpa = pd.read_csv(
    "sentences and translations/SpanishSentences.csv", sep=',')
print(dfSpa.head())

dfEng = pd.read_csv(
    'sentences and translations/EngTranslations.csv', sep=',')
print(dfEng.head())

dfIndex = pd.read_csv(
    'sentences and translations/SpaSentencesThatHaveEngTranslations.csv', sep=',')
print(dfIndex.head())

That output the following:

      0    1                               2
0  2482  spa        Tengo que irme a dormir.
1  2487  spa   Ahora, Muiriel tiene 20 años.
2  2493  spa  Simplemente no sé qué decir...
3  2495  spa      Yo estaba en las montañas.
4  2497  spa          No sé si tengo tiempo.
      0    1                               2
0  1277  eng          I have to go to sleep.
1  1282  eng              Muiriel is 20 now.
2  1287  eng     This is never going to end.
3  1288  eng  I just don't know what to say.
4  1290  eng         I was in the mountains.
      0       1
0  2482    1277
1  2487    1282
2  2493    1288
3  2493  693485
4  2495    1290

Colum 0 in dfIndex represents a Spanish sentence in dfSpa and column 1 represents the English translation in dfEng that goes with it. dfSpa has more rows than the other 2 df's so, some sentences do not have english translations. Also, dfIndex is longer than dfEng because there are some duplicate translations with different values such as with 2493, in dfIndex.head(), as shown above.

I am trying to create another data frame that simply has the Spanish sentence in one column and the corresponding English translation in the other column. How could I get this done?

Upvotes: 0

Views: 44

Answers (2)

mujjiga
mujjiga

Reputation: 16876

dfIndex.merge(
    dfSpa[[0,2]], on=0)[[1,2]].rename(columns={2: "Spa"}).merge(
        dfEng, left_on=1, right_on=0).rename(columns={2: "Eng"})[['Spa', 'Eng']]

Upvotes: 1

ChairNTable
ChairNTable

Reputation: 86

You could try:

df_n=pd.DataFrame()
df_n['A'] = [df.iloc[x].values for x in dfSpa.loc[:,0]]
df_n['B'] = [df.iloc[x].values for x in dfEng.loc[:,0]]

and then remove duplicated rows using:

df_n = df_n.drop_duplicates(subset = ['A'])

It would be easier to check if you had sample dfs.

Upvotes: 1

Related Questions