Lucky
Lucky

Reputation: 609

How to load in txt file as data in Python?

I'm learning how to use sklearn and scikit and all that to do some machine learning.

I was wondering how to import this as data?

enter image description here

This is a dataset from the million song genre dataset.

How can I make my data.target[0] equal to "classic pop and rock" (as 0) and data.target[1] equal to 0 which is "classic pop and rock" and data.target[640] equal to 1 which is "folk"?

And my data.data[0,:] be equal to -8.697, 155.007, 1, 9, and so forth (all numerical values after the title column)

Upvotes: 2

Views: 2575

Answers (1)

datawrestler
datawrestler

Reputation: 1567

as others had mentioned it was a little unclear as to what shape you were looking for, but just as a general starter, and getting the data into a very flexible format, you could read the text file into python and convert it to a pandas dataframe. I am certain their are other more compact ways of doing this, but just to provide clear steps we could start with:

import pandas as pd
import re 

file = 'filepath' #this is the file path to the saved text file
music = open(file, 'r')
lines = music.readlines()
# split the lines by comma
lines = [line.split(',') for line in lines]
# capturing the column line
columns = lines[9]
# capturing the actual content of the data, and dismissing the header info
content = lines[10:]

musicdf = pd.DataFrame(content)
# assign the column names to our dataframe
musicdf.columns = columns
# preview the dataframe
musicdf.head(10)

# the final column had formatting issues, so wanted to provide code to get rid of the "\n" in both the column title and the column values

def cleaner(txt):
    txt = re.sub(r'[\n]+', '', txt)
    return txt

# rename the column of issue
musicdf = musicdf.rename(columns = {'var_timbre12\n' : 'var_timbre12'})

# applying the column cleaning function above to the column of interest
musicdf['var_timbre12'] = musicdf['var_timbre12'].apply(lambda p: cleaner(p))

# checking the top and bottom of dataframe for column var_timbre12
musicdf['var_timbre12'].head(10)
musicdf['var_timbre12'].tail(10)

the result of this would be the following:

             %genre            track_id       artist_name  
0  classic pop and rock  TRFCOOU128F427AEC0  Blue Oyster Cult   
1  classic pop and rock  TRNJTPB128F427AE9F  Blue Oyster Cult 

By having the data in this format, you can now do lots of grouping tasks, finding certain genres and their relative attributes, etc. using pandas groupby function.

Hope this helps!

Upvotes: 2

Related Questions