Reputation: 1657
Ok, to be completely honest, I am not exactly sure how to ask this question, since I think the error could happen in multiple places, so I'll just type all of them out (thanks for being patient with a noob here).
I am trying to use the lastfm database: https://grouplens.org/datasets/hetrec-2011/
so they have this python script that helps us to read the data from this dataset.
so what i did, is to first parse the line of a csv file with the given iter_lines function:
### first, open file into a file handle object
file = os.path.join(baseDir, 'artists.dat')
file_opener = open(file, "r")
lines = iter_lines(file_opener)
where the iter_lines() function look like this (given):
def iter_lines(open_file):
reader = csv.reader(
open_file,
delimiter='\t',
)
next(reader) # Skip the header
return reader
then I tried to use their given parse_artist_line() function to read the artist.csv:
artists_df = pd.DataFrame(['key','value'])
for line in lines:
### so the parse_artist_line() will return a dictionary
artist_dict = parse_artist_line(line)
artist_list = artist_dict.items()
### try to put in a temporary dataframe
temp = pd.DataFrame.from_dict(artist_dict, orient='index')
### finally append the temporary df to the artists_df
artists_df.append(temp, ignore_index=True)
print(artists_df.head(5))
and when i print the artists_df with the last statement, i only get this output:
0
0 key
1 value
and their parse_artist_line() look like this:
def parse_artist_line(line):
(artist_id, name, _, _) = line
current_artist = deepcopy(ARTISTS)
current_artist["artist_id"] = int(artist_id)
current_artist["name"] = name
return current_artist
btw, if you print temp, it looks like this:
0
artist_id 18743
name Coptic Rain
and if i try to use "columns" for the "orient" argument input for from_dict() i'd get an error:
ValueError: If using all scalar values, you must pass an index
I've followed the following posts/info pages:
I'm not sure anymore, what i'm doing wrong (probably every step). Any help/guidance is appreciated!
Upvotes: 1
Views: 87
Reputation: 862671
I believe here is not necessary convert file to dict
and then to DataFrame
, simplier is use read_csv
and if necessary filter columns names add parameter usecols
:
artists_df = pd.read_csv('artists.dat', sep='\t', usecols=['id','name'])
print (artists_df.head())
id name
0 1 MALICE MIZER
1 2 Diary of Dreams
2 3 Carpathian Forest
3 4 Moi dix Mois
4 5 Bella Morte
If want read all columns:
artists_df = pd.read_csv('artists.dat', sep='\t')
print (artists_df.head())
id name url \
0 1 MALICE MIZER http://www.last.fm/music/MALICE+MIZER
1 2 Diary of Dreams http://www.last.fm/music/Diary+of+Dreams
2 3 Carpathian Forest http://www.last.fm/music/Carpathian+Forest
3 4 Moi dix Mois http://www.last.fm/music/Moi+dix+Mois
4 5 Bella Morte http://www.last.fm/music/Bella+Morte
pictureURL
0 http://userserve-ak.last.fm/serve/252/10808.jpg
1 http://userserve-ak.last.fm/serve/252/3052066.jpg
2 http://userserve-ak.last.fm/serve/252/40222717...
3 http://userserve-ak.last.fm/serve/252/54697835...
4 http://userserve-ak.last.fm/serve/252/14789013...
Upvotes: 1