Reputation: 619
I was looking at a past stackoverflow post but I am having trouble to build on top of it.
I want to get :
Something like
Dateframe
col1 col2
johnsmith I love cats
janesmith I own 50 cats
Code trying to modify
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html')
soup = BeautifulSoup(r.text)
for div in soup.select('[id^=post_message]'):
print(div.get_text("\n", strip=True))
Upvotes: 0
Views: 234
Reputation: 1357
I only parsed the webpage the URL you included in the question.
The posts
list may need some data clean up by eliminating the new line, tabs, and etc.
Code:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
names = [name.text for name in soup.find_all('a', href=re.compile('^http://www.catforum.com/forum/members/[0-9]'), text=True)]
posts = [post.text for post in soup.find_all('div', id=re.compile('^post_message_'))]
df = pd.DataFrame(list(zip(names, posts)))
print(df)
Output:
0 1
0 bluemilk \r\n\t\t\t\r\n\t\t\t11301\n\nDid you get the c...
1 Mochas Mommy \r\n\t\t\t\r\n\t\t\t11302\nWell, I tidied, cle...
2 Mochas Mommy \r\n\t\t\t\r\n\t\t\t11303\nDaisy sounds like s...
3 DebS \r\n\t\t\t\r\n\t\t\t11304\n\nNo, Kurt, I haven...
4 Mochas Mommy \r\n\t\t\t\r\n\t\t\t11305\n\nI had a sore neck...
5 annegirl \r\n\t\t\t\r\n\t\t\t11306\nMM- Thanks for your...
6 Mochas Mommy \r\n\t\t\t\r\n\t\t\t11307\nWelcome back annieg...
7 spirite \r\n\t\t\t\r\n\t\t\t11308. Hi annegirl! None o...
8 DebS \r\n\t\t\t\r\n\t\t\t11309\n\nWelcome to you, a...
9 annegirl \r\n\t\t\t\r\n\t\t\t11310\nDebS and Spirite th...
Upvotes: 1