Santosh
Santosh

Reputation: 91

Language identification Using pycld2

I'm trying to identify all the possible languages in the dataframe. Here is the sample of my dataframe

import pandas as pd
import pycld2 as cld2

dataload = [['AB1',"Machine learning isn't difficult"],['AB2','O aprendiz ado de máquina não é tão difíci كما يظن الناس']]
dfTest=pd.DataFrame(dataload, columns=['UID','TXT'])
UID TXT
AB1 Machine learning isn't difficult
AB2 O aprendiz ado de máquina não é tão difíci كما يظن الناس

using detect from pycld2, am able to identify all the possible languages

dfTest['language']=[cld2.detect(x)[2] for x in dfTest['TXT']]

Output is

UID TXT language
AB1 Machine learning isn't difficult (('ENGLISH', 'en', 97, 1055.0),('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0))
AB2 O aprendiz ado de máquina não é tão difíci كما يظن الناس ('PORTUGUESE', 'pt', 64, 832.0),('ARABIC', 'ar', 33, 819.0),('Unknown', 'un', 0, 0.0)

however the output i require is

UID TXT language
AB1 Machine learning isn't difficult ('ENGLISH', 'Unknown', 'Unknown')
AB2 O aprendiz ado de máquina não é tão difíci كما يظن الناس ('PORTUGUESE', 'ARABIC', 'Unknown')

or

UID TXT language
AB1 Machine learning isn't difficult ENGLISH, Unknown, Unknown
AB2 O aprendiz ado de máquina não é tão difíci كما يظن الناس PORTUGUESE, ARABIC, Unknown

I have looked through documentation and stackoverflow but could not find the relavant answer. Please guide.

Upvotes: 2

Views: 2028

Answers (1)

Corralien
Corralien

Reputation: 120429

>>> dfTest['TXT'].apply(lambda x: [r[0] for r in cld2.detect(x)[2]])
0      [ENGLISH, Unknown, Unknown]
1    [PORTUGUESE, ARABIC, Unknown]
Name: TXT, dtype: object

Upvotes: 2

Related Questions