Reputation: 117
I have the following 3 data frames:
dfSpa = pd.read_csv(
"sentences and translations/SpanishSentences.csv", sep=',')
print(dfSpa.head())
dfEng = pd.read_csv(
'sentences and translations/EngTranslations.csv', sep=',')
print(dfEng.head())
dfIndex = pd.read_csv(
'sentences and translations/SpaSentencesThatHaveEngTranslations.csv', sep=',')
print(dfIndex.head())
That output the following:
0 1 2
0 2482 spa Tengo que irme a dormir.
1 2487 spa Ahora, Muiriel tiene 20 años.
2 2493 spa Simplemente no sé qué decir...
3 2495 spa Yo estaba en las montañas.
4 2497 spa No sé si tengo tiempo.
0 1 2
0 1277 eng I have to go to sleep.
1 1282 eng Muiriel is 20 now.
2 1287 eng This is never going to end.
3 1288 eng I just don't know what to say.
4 1290 eng I was in the mountains.
0 1
0 2482 1277
1 2487 1282
2 2493 1288
3 2493 693485
4 2495 1290
Colum 0 in dfIndex represents a Spanish sentence in dfSpa and column 1 represents the English translation in dfEng that goes with it. dfSpa has more rows than the other 2 df's so, some sentences do not have english translations. Also, dfIndex is longer than dfEng because there are some duplicate translations with different values such as with 2493, in dfIndex.head(), as shown above.
I am trying to create another data frame that simply has the Spanish sentence in one column and the corresponding English translation in the other column. How could I get this done?
Upvotes: 0
Views: 44
Reputation: 16876
dfIndex.merge(
dfSpa[[0,2]], on=0)[[1,2]].rename(columns={2: "Spa"}).merge(
dfEng, left_on=1, right_on=0).rename(columns={2: "Eng"})[['Spa', 'Eng']]
Upvotes: 1
Reputation: 86
You could try:
df_n=pd.DataFrame()
df_n['A'] = [df.iloc[x].values for x in dfSpa.loc[:,0]]
df_n['B'] = [df.iloc[x].values for x in dfEng.loc[:,0]]
and then remove duplicated rows using:
df_n = df_n.drop_duplicates(subset = ['A'])
It would be easier to check if you had sample dfs.
Upvotes: 1