Ravindra S
Ravindra S

Reputation: 6432

Why is Pandas returning two rows when I am accessing only one?

I have a large dataset of tweets in csv format. There are 5 columns and '|' is the delimiter. There are no headers in the dataset.

Format of the rows : TweetID|Language|IsRetweet|TwitterHandle|TweetText

To give you an idea following are first two rows of the dataset:

2583800911711234567|und|false|@handle1|#photogrid #love #cute #adorable #appbreeze #kiss #kisses #hugs #together #photooftheday #happy #me…
579521672137654321|en|True|@handle2|RT @handle3: Today BE HAPPY!! Folks are usually about as #happy as they make their #minds up to be. #In

(TweetId and TwitterHandle changed for privacy reasons)

I am trying to find non-English tweets which are few in the dataset.

Following is the code:

def extract_tweets(filename):

    eng_df = pd.read_csv(filename, sep='|')

    print("-"*40)
    print("First tweet")
    print(str(eng_df.iloc[0])) # line a

    lang_list = list(eng_df.iloc[:,1])

    print("-"*40)
    print("Langauage of first three tweets: " + str(lang_list[:3])) #line b
    print("Total tweets in this file: " +  str(len(lang_list)))
    print("English tweet count: " + str(lang_list.count("en"))) #line c

    count = 0
    f = open(filename,'r')
    for line in f:
        if line.find("|und|") != -1:
        count += 1
    print("-"*40)
    print("Count of \"|und|\" in the file: " + str(count))

extract_tweets("english_tweets")

And following is the output:

sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
----------------------------------------
First tweet
2583800911711234567                                                   579521672137654321
und                                                                                   en
false                                                                               True
@handle1                                                             @handle2
#photogrid #love #cute #adorable #appbreeze #kiss    RT @handle3: Today BE HAPPY!!
Name: 0, dtype: object
----------------------------------------
Langauage of first three tweets: ['en', 'en', 'en']
Total tweets in this file: 1225236
English tweet count: 1225236
----------------------------------------
Count of "|und|" in the file: 1

Questions:

1) When I am just trying to access first row (line a) why it is printing first and second row? I have tried this by accessing different rows. Each time it is printing two rows.

2) When I'm fetching the entire second column of language why it is not capturing the first tweet and its language value 'und'? (line b and c). This is evident from the counts of tweets I am printing.But funnily enough when I open the file in a normal way it does show me that there is one tweet with language value as 'und'. Am I missing something? Is it treating first row differently, as column headers, perhaps? (by default)

Upvotes: 0

Views: 48

Answers (1)

Roy Jevnisek
Roy Jevnisek

Reputation: 320

You might want to specify explicitly that the header is missing:

 pd.read_csv(filename, sep="|", header=None)

Upvotes: 2

Related Questions