Reputation: 237
I'm parsing some HTML data using Pandas like this:
rankings = pd.read_html('https://en.wikipedia.org/wiki/Rankings_of_universities_in_the_United_Kingdom')
university_guide = rankings[0]
This gives me a nice data frame:
What I want is to reshape this data frame so that there are only two columns (rank and university name). My current solution is to do something like this:
ug_copy = rankings[0][1:]
npa1 = ug_copy.as_matrix( columns=[0,1] )
npa2 = ug_copy.as_matrix( columns=[2,3] )
npa3 = ug_copy.as_matrix( columns=[4,5] )
npam = np.append(npa1,npa2)
npam = np.append(npam,npa3)
reshaped = npam.reshape((npam.size/2,2))
pd.DataFrame(data=reshaped)
This gives me what I want, but it doesn't seem like it could possibly be the best solution. I can't seem to find a good way to complete this all using a data frame. I've tried using stack/unstack and pivoting the data frame (as some of the other solutions here have suggested), but I haven't had any luck. I've tried doing something like this:
ug_copy.columns=['Rank','University','Rank','University','Rank','University']
ug_copy = ug_copy[1:]
ug_copy.groupby(['Rank', 'University'])
There has to be something small I'm missing!
Upvotes: 0
Views: 10134
Reputation: 14738
This is probably a bit shorter (also note that you can use the header
option in read_html
to save a bit of work):
import pandas as pd
rankings = pd.read_html('https://en.wikipedia.org/wiki/Rankings_of_universities_in_the_United_Kingdom', header=0)
university_guide = rankings[0]
df = pd.DataFrame(university_guide.values.reshape((30, 2)), columns=['Rank', 'University'])
df = df.sort('Rank').reset_index(drop=True)
print df
Upvotes: 7