Simon Breton
Simon Breton

Reputation: 2876

Why only a part of my list are write in a csv file?

I'm new to python trying to build my first script. I want to scrap a list of url and export it in a csv file.

My script is well executed however when opening the csv file only few line of data are written. When I'm printing the list I'm trying to write (sharelist and sharelist1), the print is complete whereas the csv file is not.

Here is a part of my code :

  for url in urllist[10:1000]:  
                # query the website and return the html to the variable 'page'
        try:
            page = urllib2.urlopen(url)
        except urllib2.HTTPError as e:
                if e.getcode() == 404: # eheck the return code
                    continue
        soup = BeautifulSoup(page, 'html.parser')

                    # Take out the <div> of name and get its value
        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
          continue
        share = name_box.text.strip() # strip() is used to remove starting and trailing

        # save the data in tuple
        sharelist.append(url)
        sharelist1.append(share)

    # open a file for writing.
        csv_out = open('mycsv.csv', 'wb')

    # create the csv writer object.
        mywriter = csv.writer(csv_out)

    # writerow - one row of data at a time.
        for row in zip(sharelist, sharelist1):
            mywriter.writerow(row)

    # always make sure that you close the file.
    # otherwise you might find that it is empty.
        csv_out.close()

Not sure which part of my code I should share here. Please tell me if it's not enough !

Upvotes: 0

Views: 116

Answers (3)

Teach3r
Teach3r

Reputation: 13

Open a file for writing, using a context manager, that way, you don't need to close the file explicitly.

with open('mycsv.csv', 'w') as file_obj:
    mywriter = csv.writer(file_obj)
    for url in urllist[10:1000]:
        try:
            page = urllib2.urlopen(url)
        except urllib2.HTTPError as e:
                if e.getcode() == 404: # check the return code
                    continue
        soup = BeautifulSoup(page, 'html.parser')

        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
            continue
        share = name_box.text.strip()
        # no need to use zip, and append in 2 lists as they're really expensive calls,
        # and by the looks of it, I think it'll create duplicate rows in your file
        mywriter.writerow((url, share))

Upvotes: 0

syedelec
syedelec

Reputation: 1320

The problem has been found and the best solution for files is to use with keyword which permits to close the file at the end automatically :

with open('mycsv.csv', 'wb') as csv_out:
    mywriter = csv.writer(csv_out)
    for url in urllist[10:1000]:  

        try:
            page = urllib2.urlopen(url)
        except urllib2.HTTPError as e:
                if e.getcode() == 404:
                    continue
        soup = BeautifulSoup(page, 'html.parser')

        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
          continue
        share = name_box.text.strip()

        # save the data in tuple
        sharelist.append(url)
        sharelist1.append(share)

        for row in zip(sharelist, sharelist1):
            mywriter.writerow(row)

Upvotes: 1

Dr Xorile
Dr Xorile

Reputation: 994

The problem is that you are opening the file every time you run through the loop. This essentially will overwrite the previous file.

# open a file for writing.
    csv_out = open('mycsv.csv', 'wb')

# create the csv writer object.
    mywriter = csv.writer(csv_out)

# writerow - one row of data at a time.
    for row in zip(sharelist, sharelist1):
        mywriter.writerow(row)

# always make sure that you close the file.
# otherwise you might find that it is empty.
    csv_out.close()

Either open the file before the loop, or open it with the append option.

This is option one (note the indentation):

# open a file for writing.
csv_out = open('mycsv.csv', 'wb')

# create the csv writer object.
mywriter = csv.writer(csv_out)
for url in urllist[10:1000]:  
    try:
        page = urllib2.urlopen(url)
    except urllib2.HTTPError as e:
            if e.getcode() == 404: # eheck the return code
                continue
    soup = BeautifulSoup(page, 'html.parser')

    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
      continue
    share = name_box.text.strip()

    # save the data in tuple
    sharelist.append(url)
    sharelist1.append(share)

# writerow - one row of data at a time.
    for row in zip(sharelist, sharelist1):
        mywriter.writerow(row)

# always make sure that you close the file.
# otherwise you might find that it is empty.
csv_out.close()

This is option 2:

for url in urllist[10:1000]:  
            # query the website and return the html to the variable 'page'
    try:
        page = urllib2.urlopen(url)
    except urllib2.HTTPError as e:
            if e.getcode() == 404: # eheck the return code
                continue
    soup = BeautifulSoup(page, 'html.parser')

                # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
      continue
    share = name_box.text.strip() # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append(url)
    sharelist1.append(share)

# open a file for writing.
    csv_out = open('mycsv.csv', 'ab')

# create the csv writer object.
    mywriter = csv.writer(csv_out)

# writerow - one row of data at a time.
    for row in zip(sharelist, sharelist1):
        mywriter.writerow(row)

# always make sure that you close the file.
# otherwise you might find that it is empty.
    csv_out.close()

Upvotes: 3

Related Questions