OptimusPrime
OptimusPrime

Reputation: 619

Web Scraping a Forum Post in Python Using Beautiful soup and lxml, saving results to a pandas dataframe

I was looking at a past stackoverflow post but I am having trouble to build on top of it.

I want to get :

  1. users of who posted it in the form
  2. forum post content
  3. save it all to a dataframe

Something like

Dateframe

col1       col2   
johnsmith  I love cats
janesmith  I own 50 cats

Code trying to modify

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html')

soup = BeautifulSoup(r.text)


for div in soup.select('[id^=post_message]'):
    print(div.get_text("\n", strip=True))

Upvotes: 0

Views: 234

Answers (1)

Ali
Ali

Reputation: 1357

I only parsed the webpage the URL you included in the question.

The posts list may need some data clean up by eliminating the new line, tabs, and etc.

Code:

import requests, re
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html', headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')
names = [name.text for name in soup.find_all('a', href=re.compile('^http://www.catforum.com/forum/members/[0-9]'), text=True)]
posts = [post.text for post in soup.find_all('div', id=re.compile('^post_message_'))]
df = pd.DataFrame(list(zip(names, posts)))

print(df)

Output:

              0                                                  1
0      bluemilk  \r\n\t\t\t\r\n\t\t\t11301\n\nDid you get the c...
1  Mochas Mommy  \r\n\t\t\t\r\n\t\t\t11302\nWell, I tidied, cle...
2  Mochas Mommy  \r\n\t\t\t\r\n\t\t\t11303\nDaisy sounds like s...
3          DebS  \r\n\t\t\t\r\n\t\t\t11304\n\nNo, Kurt, I haven...
4  Mochas Mommy  \r\n\t\t\t\r\n\t\t\t11305\n\nI had a sore neck...
5      annegirl  \r\n\t\t\t\r\n\t\t\t11306\nMM- Thanks for your...
6  Mochas Mommy  \r\n\t\t\t\r\n\t\t\t11307\nWelcome back annieg...
7       spirite  \r\n\t\t\t\r\n\t\t\t11308. Hi annegirl! None o...
8          DebS  \r\n\t\t\t\r\n\t\t\t11309\n\nWelcome to you, a...
9      annegirl  \r\n\t\t\t\r\n\t\t\t11310\nDebS and Spirite th...

Upvotes: 1

Related Questions