Encipher
Encipher

Reputation: 2946

Transferring the the data from a file to pandas dataframe, which have no file extension

I like to use SMS Spam Collection Data Set which can be found on UCI Machine Learning Repository, to build a classification model. The data file that is shared on the repository has no file extension. The data is look like the following

    ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
    ham Ok lar... Joking wif u oni...
    spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
    ham U dun say so early hor... U c already then say...
    ham Nah I don't think he goes to usf, he lives around here though
    spam    FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Where ham or spam should be the class attribute and the rest of the portion is the message. How could I transfer the dataset into Pandas dataframe? The dataframe should like the following

Message Class   Messages
ham         Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham        Ok lar... Joking wif u oni...
spam      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham       U dun say so early hor... U c already then say...
ham      Nah I don't think he goes to usf, he lives around here though
spam    FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Upvotes: 0

Views: 61

Answers (2)

Selknam_CL
Selknam_CL

Reputation: 34

This should work

df= pd.read_csv("your_file.csv", sep="\t")
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']
df.head()

Upvotes: 0

Timeless
Timeless

Reputation: 37877

The file seems like a .txt tab separated, so you can use pandas.read_csv :

import pandas as pd

df = pd.read_csv(filepath_or_buffer= "SMSSpamCollection",
                 header=None, sep="\t", names=["Message Class", "Messages"])

# Output :

enter image description here

Upvotes: 1

Related Questions