rpa1111
rpa1111

Reputation: 13

How to identify Spanish vs. English Text from a csv of tweets?

I'm trying to create a column that identifies English and Spanish tweets from a dataframe with several rows of tweets. Ideally tweets in English would be categorized as a 1 and those in Spanish would be marked as 0.

The end goal is to be able to filter out Spanish tweets from my dataframe to save English tweets in a new CSV. I looked at using Textblob, langdetect, and also fastText, but everything I've found gives instructions for just running the code on 1 string of text at a time.

Is there an easy way to categorize by language (English/Spanish) for a whole dataframe using Python?

Upvotes: 1

Views: 384

Answers (2)

GHOST5454
GHOST5454

Reputation: 69

  • I think there are two ways.

    i) to train a model to determine the text is Spanish or English using (NLP).

    ii) just trow the sentences to a list and use if condition with Spanish Character to give 1 or 0 on the bases of conditions.

    Hope it is helpful.

Upvotes: 0

JC Royal Fish
JC Royal Fish

Reputation: 86

The apply method can be used in this case. Using langdetect as an example:

>>> df = pd.DataFrame({"text": ["hello, this is a tweet", "hola esto es un tweet"]})
>>> df["language"] = df["text"].apply(detect)
>>> df
                    text language
0  hello this is a tweet       en
1  hola esto es un tweet       es

or in your particular case if you want 0, 1 as indicators, you could do:

>>> df["language"] = df["text"].apply(lambda x: (detect(x) == "en") * 1)
>>> df
                    text  language
0  hello this is a tweet         1
1  hola esto es un tweet         0

Upvotes: 1

Related Questions