SCH
SCH

Reputation: 83

Pandas for Excel and selenium loop

I am trying to print out values from excel, and values are in numbers. I goal is to read these values and search in google one by one. Will stop for x seconds when the value is 'nan', then skip this 'nan' and then keep moving on to next.

Problems faced:

  1. It is printing out in scientific notation format
  2. Want to stop doing something when its 'nan' in excel
  3. Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]

My solution:

  1. I have 'lambda x: '%0.2f' % x' inside set_option to make it print out xxxxxx.00 with 2 decimal. Even i want it in int, but its already better than scientific notation format
  2. Used 'if' to see if value in upc[i] equal to 'nan' <--nan is what i got from print. But it still print out range of 20 values with 'nan'.
  3. I can't think of something now

Code:

import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
import msvcrt
import datetime
import time

driver = webdriver.Chrome()
#Settings
pd.set_option('display.width',10, 'display.max_rows', 10, 'display.max_colwidth',100, 'display.width',10, 'display.float_format', lambda x: '%0.2f' % x)

df = pd.read_excel(r"BARCODE.xlsx", skiprows = 2, sheet_name = 'products id')

#Unnamed: 1 is also an empty column, i just didn't input UPC as title in excel.
upc = df['Unnamed: 1']

#I can't print out as interger...It will always have a xxxxx.0
print((upc[0:20]))

count = len(upc)
i = 0
for i in range(count ):
    if upc[i] == 'nan':
        'skip for x seconds and continue, i am not sure how to do yet'
    else:
        print(int(upc[i]))
        driver.get('https://www.google.com')
        driver.find_element_by_name('q').send_keys(int(upc[i]))
        i = i + 1

Print out:

3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan          <- i want program to stop here so i can do something else.
3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan
Name: Unnamed: 1, Length: 20, dtype: float64
3337872411991
3433422408159
3337875598071
3337872412516
3337875518451
etc....

Googled some format about number, such as set printing format, but I got confused between .format and lambda.

Upvotes: 0

Views: 250

Answers (2)

ag6025
ag6025

Reputation: 51

It is printing out in scientific notation format

It seems you have numbers like UPC and EANs. You can probably solve that by marking numbers as text instead. If you need to have always length 13 you can correct it with appending zeroes at start.

Want to stop doing something when its nan in excel

Simplest solution could be to use input and accept any character to continue executing your code. But if you want to have few seconds time.sleep() is good as well

Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]

Some points you may want to reconsider:

  • Iterating in python can be done with enumerate() if you need index values. If you do not need index you may simply drop it instead. for value in data_frame['UPC']:
  • With selenium you can directly scrape results instead of using new tabs.

Below you can check out working example (at least on my machine with python3, w10 and chrome exe driver).

import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys

# Settings
pd.set_option('display.width', 10, 'display.max_rows', 10, 'display.max_colwidth', 100, 'display.width', 10,
              'display.float_format', lambda x: '%0.2f' % x)

data_frame = pd.read_excel('test.xlsx', sheet_name='products id', skip_blank_lines=False)

# I have chrome driver in exe, so this is how I need to inject it to get driver out
driver = webdriver.Chrome('chromedriver.exe')
google = 'https://www.google.com'

for index, value in enumerate(data_frame['UPC']):  # named the column in excel file
    if pd.isna(value):
        print('{}: zzz'.format(index))
        sleep(2)  # will sleep for 2 seconds, use input() if you want to wait indefinitely instead
    else:
        print('{}: {} {}'.format(index, value, type(value)))
        # since given values are float, you can convert it to int
        value = int(value)
        driver.get(google)
        google_search = driver.find_element_by_name('q')
        google_search.send_keys(value)
        google_search.send_keys('\uE007')  # this is "ENTER" for committing your search in google or Keys.ENTER

        sleep(0.5)
        # you may want to wait a bit before page loads fully, then scrape info you want
        # also consider using try-except blocks if something unexpected happens

        # if you want to open new tab (windows + chrome driver)
        # open a link in a new window - workaround
        helping_link = driver.find_element_by_link_text('Help')
        actions = ActionChains(driver)
        actions.key_down(Keys.CONTROL).click(helping_link).key_up(Keys.CONTROL).perform()
        driver.switch_to.window(driver.window_handles[-1])

# close your instance of chrome driver or leave it if you need your tabs
# driver.close()

Upvotes: 1

Bastien Harkins
Bastien Harkins

Reputation: 305

  1. check this post
if upc[i].isnull():
    time.sleep(3)
  1. check out this post, which boils down to:

    driver.execute_script("window.open('https://www.google.com');") driver.switch_to.window(driver.window_handles[-1])

Upvotes: 1

Related Questions