Reputation: 87
I am not much of a programmer. Just learning. I want to extract (public) electoral data from my country's electoral Authority using Python. This is for academic purposes but I also want to develop my programming skills. All of the data I store will be posted publicly, of course.
I need to know which python modules allow me to enter websites and read the HTML to recognize certain data which I need to collect. I just hope for some guidelines on how to, or any additional suggestions anyone has.
I wish o extract votes for each party and additional data presented completely deaggregated: State/Municipality/County/Center/Table. Finally, I hope to store it in a csv or xlsx (I guess I'd use openpyxl
or xlsxwriter
).
My idea is to make a program that:
1) Takes the link input (e.g.);
2) It identifies the links for every State on the left of the HTML (Amazonas, Anzoategui, and so on);
3) For loop though each state and finds the url (it's a HTML so I guess it'll search & extract the <a>
tag, right?) for each State;
4) Repeats with municipalities;
4) Repeats with "Parroquia" (county);
5) Repeats for every voting center;
6) Finally for every voting table in each center (1, 2, 3... whatever);
7) Next it stores the result for every party (eg. manually I'd press the name of every candidate, recognize the LOGO of the party and store its votes (30 in the example)). And it also should store the data from the "technical table" at the end.
The final result should be to store all the data: State, Municipality, County, Center, Table, and the result for each party.
Upvotes: 1
Views: 3474
Reputation: 36
The following will help:
from selenium import webdriver - For setting up a new webdriver to go to websites. (The one for Chrome works quite well)
from selenium.webdriver.common.by import By - For selecting html elements by css selector, tag name, id, etc.
from selenium.webdriver.support.ui import WebDriverWait - For setting up a minimum load time for the url to load
from selenium.webdriver.support import expected_conditions as EC - To set up expected conditions uner which to take action when waiting for a url to load. For example a condition could be waiting until all <a>
tags have been loaded.
from selenium.webdriver.common.keys import Keys - For simulating keypresses or sending text to an HTML element
from BeautifulSoup import BeautifulSoup - For parsing through a downloaded HTML document
import re - To enable the use of regular expressions
import xlwt - For writing to Microsoft Excel workbooks
from xlutils.copy import copy - For creating copies of Microsoft Excel workbooks
import time - For setting up pausing times while Python code is executing
import xlrd - For reading from Microsoft Excel workbooks
Packages to download:
xlrd 0.9.4
xlutils 1.7.1
xlwt 1.0.0
BeautifulSoup 4.4.1
selenium 2.48.0
Most of the above can be downloaded from the python package index
Upvotes: 1