Reputation: 2031
This is the code I have. Due to content of the raw data to be parsed, I end up with the 'user list' and the 'tweet list' being of different length. When writing the lists as columns in a data frame, I get ValueError: arrays must all be same length
. I realize this, but have been looking for a way to work around it, printing 0
or NaN
in the right places of the shorter array. Any ideas?
import pandas
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('#raw.html'))
chunk = soup.find_all('div', class_='content')
userlist = []
tweetlist = []
for tweet in chunk:
username = tweet.find_all(class_='username js-action-profile-name')
for user in username:
user2 = user.get_text()
userlist.append(user2)
for text in chunk:
tweets = text.find_all(class_='js-tweet-text tweet-text')
for tweet in tweets:
tweet2 = tweet.get_text().encode('utf-8')
tweetlist.append('|'+tweet2)
print len(tweetlist)
print len(userlist)
#MAKE A DATAFRAME WITH THIS
data = {'tweet' : tweetlist, 'user' : userlist}
frame = pandas.DataFrame(data)
print frame
# Export dataframe to csv
frame.to_csv('#parsed.csv', index=False)
Upvotes: 7
Views: 14861
Reputation: 1
you can easily solve this issue by write this code to make the data frame.
dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in Sl.items() })
Upvotes: 0
Reputation: 1248
Try this:
frame = pandas.DataFrame.from_dict(d, orient='index')
After that, you should transpose your frame with:
frame = frame.transpose()
Then you can export to csv:
frame.to_csv('#parsed.csv', index=False)
Upvotes: 3
Reputation: 366
I'm not sure that this is exactly what you want, but anyway:
d = dict(tweets=tweetlist, users=userlist)
pandas.DataFrame({k : pandas.Series(v) for k, v in d.iteritems()})
Upvotes: 13