Reputation: 13
I'm trying to create a column that identifies English and Spanish tweets from a dataframe with several rows of tweets. Ideally tweets in English would be categorized as a 1 and those in Spanish would be marked as 0.
The end goal is to be able to filter out Spanish tweets from my dataframe to save English tweets in a new CSV. I looked at using Textblob, langdetect, and also fastText, but everything I've found gives instructions for just running the code on 1 string of text at a time.
Is there an easy way to categorize by language (English/Spanish) for a whole dataframe using Python?
Upvotes: 1
Views: 384
Reputation: 69
I think there are two ways.
i) to train a model to determine the text is Spanish or English using (NLP).
ii) just trow the sentences to a list and use if condition with Spanish Character to give 1 or 0 on the bases of conditions.
Hope it is helpful.
Upvotes: 0
Reputation: 86
The apply
method can be used in this case. Using langdetect
as an example:
>>> df = pd.DataFrame({"text": ["hello, this is a tweet", "hola esto es un tweet"]})
>>> df["language"] = df["text"].apply(detect)
>>> df
text language
0 hello this is a tweet en
1 hola esto es un tweet es
or in your particular case if you want 0, 1 as indicators, you could do:
>>> df["language"] = df["text"].apply(lambda x: (detect(x) == "en") * 1)
>>> df
text language
0 hello this is a tweet 1
1 hola esto es un tweet 0
Upvotes: 1