Reputation: 6432
I have a large dataset of tweets in csv format. There are 5 columns and '|' is the delimiter. There are no headers in the dataset.
Format of the rows : TweetID|Language|IsRetweet|TwitterHandle|TweetText
To give you an idea following are first two rows of the dataset:
2583800911711234567|und|false|@handle1|#photogrid #love #cute #adorable #appbreeze #kiss #kisses #hugs #together #photooftheday #happy #me…
579521672137654321|en|True|@handle2|RT @handle3: Today BE HAPPY!! Folks are usually about as #happy as they make their #minds up to be. #In
(TweetId and TwitterHandle changed for privacy reasons)
I am trying to find non-English tweets which are few in the dataset.
Following is the code:
def extract_tweets(filename):
eng_df = pd.read_csv(filename, sep='|')
print("-"*40)
print("First tweet")
print(str(eng_df.iloc[0])) # line a
lang_list = list(eng_df.iloc[:,1])
print("-"*40)
print("Langauage of first three tweets: " + str(lang_list[:3])) #line b
print("Total tweets in this file: " + str(len(lang_list)))
print("English tweet count: " + str(lang_list.count("en"))) #line c
count = 0
f = open(filename,'r')
for line in f:
if line.find("|und|") != -1:
count += 1
print("-"*40)
print("Count of \"|und|\" in the file: " + str(count))
extract_tweets("english_tweets")
And following is the output:
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
----------------------------------------
First tweet
2583800911711234567 579521672137654321
und en
false True
@handle1 @handle2
#photogrid #love #cute #adorable #appbreeze #kiss RT @handle3: Today BE HAPPY!!
Name: 0, dtype: object
----------------------------------------
Langauage of first three tweets: ['en', 'en', 'en']
Total tweets in this file: 1225236
English tweet count: 1225236
----------------------------------------
Count of "|und|" in the file: 1
Questions:
1) When I am just trying to access first row (line a) why it is printing first and second row? I have tried this by accessing different rows. Each time it is printing two rows.
2) When I'm fetching the entire second column of language why it is not capturing the first tweet and its language value 'und'? (line b and c). This is evident from the counts of tweets I am printing.But funnily enough when I open the file in a normal way it does show me that there is one tweet with language value as 'und'. Am I missing something? Is it treating first row differently, as column headers, perhaps? (by default)
Upvotes: 0
Views: 48
Reputation: 320
You might want to specify explicitly that the header is missing:
pd.read_csv(filename, sep="|", header=None)
Upvotes: 2