Language identification Using pycld2

Question

I'm trying to identify all the possible languages in the dataframe. Here is the sample of my dataframe

import pandas as pd
import pycld2 as cld2

dataload = [['AB1',"Machine learning isn't difficult"],['AB2','O aprendiz ado de máquina não é tão difíci كما يظن الناس']]
dfTest=pd.DataFrame(dataload, columns=['UID','TXT'])

UID	TXT
AB1	Machine learning isn't difficult
AB2	O aprendiz ado de máquina não é tão difíci كما يظن الناس

using detect from pycld2, am able to identify all the possible languages

dfTest['language']=[cld2.detect(x)[2] for x in dfTest['TXT']]

Output is

UID	TXT	language
AB1	Machine learning isn't difficult	(('ENGLISH', 'en', 97, 1055.0),('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0))
AB2	O aprendiz ado de máquina não é tão difíci كما يظن الناس	('PORTUGUESE', 'pt', 64, 832.0),('ARABIC', 'ar', 33, 819.0),('Unknown', 'un', 0, 0.0)

however the output i require is

UID	TXT	language
AB1	Machine learning isn't difficult	('ENGLISH', 'Unknown', 'Unknown')
AB2	O aprendiz ado de máquina não é tão difíci كما يظن الناس	('PORTUGUESE', 'ARABIC', 'Unknown')

or

UID	TXT	language
AB1	Machine learning isn't difficult	ENGLISH, Unknown, Unknown
AB2	O aprendiz ado de máquina não é tão difíci كما يظن الناس	PORTUGUESE, ARABIC, Unknown

I have looked through documentation and stackoverflow but could not find the relavant answer. Please guide.

Corralien · Accepted Answer

>>> dfTest['TXT'].apply(lambda x: [r[0] for r in cld2.detect(x)[2]])
0      [ENGLISH, Unknown, Unknown]
1    [PORTUGUESE, ARABIC, Unknown]
Name: TXT, dtype: object

Language identification Using pycld2

Answers (1)

Related Questions