Valerie Sharp
Valerie Sharp

Reputation: 49

Python Web scraper using Beautifulsoup 4

I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?

import requests
import re
from bs4 import BeautifulSoup

#Website you want to scrap info from  
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")

# Creating the CSV file
commonFile = open('common_words.csv', 'wb')

# Grabbing the lines you want
  for node in soup.findAll("tr"):
  # Getting just the text and removing the html
  words = ''.join(node.findAll(text=True))
  # Removing the extra lines
  ID = re.sub(r'[\t\r\n]', '', words)
  # Needed to add a break in the line to make the rows
  update = ''.join(ID)+'\n'
  # Now we add this to the file 
  commonFile.write(update)
commonFile.close()

Upvotes: 0

Views: 179

Answers (1)

Haseeb Ahmad
Haseeb Ahmad

Reputation: 574

How about this?

import requests
import csv
from bs4 import BeautifulSoup

f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])

#Website you want to scrap info from  
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")

words = soup.select('div[class=file] tr')

for i in range(len(words)):
    word = words[i].text
    f.writerow([word.replace('\n', '')])

Upvotes: 1

Related Questions