A_Patterson
A_Patterson

Reputation: 555

Writing csv rows with multiple list items in columns

I am using BeatifulSoup to scrape some web data into a csv file. Some of the elements that I am scraping are lists of specific items; two sets of list to be exact. Below is an example of what the data will come through as:

Name, Image_Filename, [2015, 2016, 2017], [12, 55, 74]

What I need is a row for each individual item in each list like this:

I already have all the data scraped into a csv file and I am looking to avoid going through the entire sheet and manually scrubbing the data. I am not opposed to doing this but if Python can be leveraged to complete this task, I would prefer to go that route...

Here is my entire script I use to scrape the data. I am fairly new to Python with limited experience in web scraping / browser automation. I don't know if formatting the data could be included in this or if this is another one I would have to write:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
import re
import csv

with open('hyperlinks.csv', 'r') as startFile:

    for line in startFile:
        url = urlopen(line)
        soup = BeautifulSoup(url, 'html.parser')
        
        data_container = soup.find('aside')                                            
        image = data_container.find('a',attrs={'class':'image-thumbnail'})
        image_href = image.get('href')                                                   

        img_container = data_container.find('img')
        data_image_name = img_container.get('data-image-name')                                 
        filename = data_image_name.split('.')                                           
        final_filename = filename[0]                                                    
        train_title = data_container.find('h2')                                         
        title_text = train_title.get_text()

        image_filename = final_filename
        full = image_filename +'.jpg'                                           

        series = data_container.find('div', attrs={'data-source':'series'})            
        wave_links = series.find('div')
        wave_set = []                                                  
        wave_links_sep = wave_links.find_all('a')
        for item in wave_links_sep:
            text_only = item.get_text()
            wave_set.append(text_only)

        bag = data_container.find('div', attrs={'data-source':'bag_code'})
        bag_code = bag.find('div')
        bag_text = bag_code.get_text()
        regex = re.compile(r'\s\((2015|2016|2017|2018|2019)\)')
        bag_numbers = re.sub(regex,",",bag_text)
        bag_list = []
        for nums in bag_numbers.split(','):
            bag_list.append(nums)

        filtered_bag_list = list(filter(None,bag_list))

        with open('train_data.csv', 'a', newline='') as myFile:
            writer = csv.writer(myFile)
            writer.writerow([title_text, full, wave_set, filtered_bag_list])

Upvotes: 1

Views: 673

Answers (1)

Patrick Artner
Patrick Artner

Reputation: 51683

You can zip your both item lists:

for wvs,bgl in zip(wave_set,filtered_bag_list):
    writer.writerow([title_text, full, wvs, bgl])

if your lists are of same length and correspond index-wise.

Full example:

wave_set = [2015, 2016, 2017]
filtered_bag_list = [12, 55, 74]

import csv
with open('train_data.csv', 'a', newline='') as myFile:
    writer = csv.writer(myFile)
    for wvs,bgl in zip(wave_set,filtered_bag_list):
        writer.writerow(["some","text", wvs, bgl])

with open("train_data.csv") as f:
    print(f.read())

Output in file:

some,text,2015,12
some,text,2016,55
some,text,2017,74

zip( [1,2,3],["a","b","c"])

creates tuples (1,"a"), (2,"b"), (3,"c") and provides them as iterator - see f.e. Zip lists in Python for more insights.

Upvotes: 1

Related Questions