Reputation: 555
I am using BeatifulSoup to scrape some web data into a csv file. Some of the elements that I am scraping are lists of specific items; two sets of list to be exact. Below is an example of what the data will come through as:
Name, Image_Filename, [2015, 2016, 2017], [12, 55, 74]
What I need is a row for each individual item in each list like this:
I already have all the data scraped into a csv file and I am looking to avoid going through the entire sheet and manually scrubbing the data. I am not opposed to doing this but if Python can be leveraged to complete this task, I would prefer to go that route...
Here is my entire script I use to scrape the data. I am fairly new to Python with limited experience in web scraping / browser automation. I don't know if formatting the data could be included in this or if this is another one I would have to write:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
import re
import csv
with open('hyperlinks.csv', 'r') as startFile:
for line in startFile:
url = urlopen(line)
soup = BeautifulSoup(url, 'html.parser')
data_container = soup.find('aside')
image = data_container.find('a',attrs={'class':'image-thumbnail'})
image_href = image.get('href')
img_container = data_container.find('img')
data_image_name = img_container.get('data-image-name')
filename = data_image_name.split('.')
final_filename = filename[0]
train_title = data_container.find('h2')
title_text = train_title.get_text()
image_filename = final_filename
full = image_filename +'.jpg'
series = data_container.find('div', attrs={'data-source':'series'})
wave_links = series.find('div')
wave_set = []
wave_links_sep = wave_links.find_all('a')
for item in wave_links_sep:
text_only = item.get_text()
wave_set.append(text_only)
bag = data_container.find('div', attrs={'data-source':'bag_code'})
bag_code = bag.find('div')
bag_text = bag_code.get_text()
regex = re.compile(r'\s\((2015|2016|2017|2018|2019)\)')
bag_numbers = re.sub(regex,",",bag_text)
bag_list = []
for nums in bag_numbers.split(','):
bag_list.append(nums)
filtered_bag_list = list(filter(None,bag_list))
with open('train_data.csv', 'a', newline='') as myFile:
writer = csv.writer(myFile)
writer.writerow([title_text, full, wave_set, filtered_bag_list])
Upvotes: 1
Views: 673
Reputation: 51683
You can zip your both item lists:
for wvs,bgl in zip(wave_set,filtered_bag_list):
writer.writerow([title_text, full, wvs, bgl])
if your lists are of same length and correspond index-wise.
Full example:
wave_set = [2015, 2016, 2017]
filtered_bag_list = [12, 55, 74]
import csv
with open('train_data.csv', 'a', newline='') as myFile:
writer = csv.writer(myFile)
for wvs,bgl in zip(wave_set,filtered_bag_list):
writer.writerow(["some","text", wvs, bgl])
with open("train_data.csv") as f:
print(f.read())
Output in file:
some,text,2015,12 some,text,2016,55 some,text,2017,74
zip( [1,2,3],["a","b","c"])
creates tuples (1,"a"), (2,"b"), (3,"c")
and provides them as iterator - see f.e. Zip lists in Python for more insights.
Upvotes: 1