Reputation: 91
I'm trying to identify all the possible languages in the dataframe. Here is the sample of my dataframe
import pandas as pd
import pycld2 as cld2
dataload = [['AB1',"Machine learning isn't difficult"],['AB2','O aprendiz ado de máquina não é tão difíci كما يظن الناس']]
dfTest=pd.DataFrame(dataload, columns=['UID','TXT'])
UID | TXT |
---|---|
AB1 | Machine learning isn't difficult |
AB2 | O aprendiz ado de máquina não é tão difíci كما يظن الناس |
using detect from pycld2, am able to identify all the possible languages
dfTest['language']=[cld2.detect(x)[2] for x in dfTest['TXT']]
Output is
UID | TXT | language |
---|---|---|
AB1 | Machine learning isn't difficult | (('ENGLISH', 'en', 97, 1055.0),('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)) |
AB2 | O aprendiz ado de máquina não é tão difíci كما يظن الناس | ('PORTUGUESE', 'pt', 64, 832.0),('ARABIC', 'ar', 33, 819.0),('Unknown', 'un', 0, 0.0) |
however the output i require is
UID | TXT | language |
---|---|---|
AB1 | Machine learning isn't difficult | ('ENGLISH', 'Unknown', 'Unknown') |
AB2 | O aprendiz ado de máquina não é tão difíci كما يظن الناس | ('PORTUGUESE', 'ARABIC', 'Unknown') |
or
UID | TXT | language |
---|---|---|
AB1 | Machine learning isn't difficult | ENGLISH, Unknown, Unknown |
AB2 | O aprendiz ado de máquina não é tão difíci كما يظن الناس | PORTUGUESE, ARABIC, Unknown |
I have looked through documentation and stackoverflow but could not find the relavant answer. Please guide.
Upvotes: 2
Views: 2028
Reputation: 120429
>>> dfTest['TXT'].apply(lambda x: [r[0] for r in cld2.detect(x)[2]])
0 [ENGLISH, Unknown, Unknown]
1 [PORTUGUESE, ARABIC, Unknown]
Name: TXT, dtype: object
Upvotes: 2