Reputation: 29
I am new to python and pandas and have created a test web page with html code to use to help with learning how to pull the data and then format into CSV for use in excel. Below is the code I have come up with that puts it into a nice format but I am stuck on how to format it into a CSV file to import.
Code:
# Importing pandas
import pandas as pd
# The webpage URL whose table we want to extract
url = "/home/dvm01/e007"
# Assign the table data to a Pandas dataframe
table = pd.read_html(url,**index_col=0**)[0]
#table2 = pd.read_html(url)[0],pd.read_html(url)[1],pd.read_html(url)[6]
# Print the dataframe
print(table)
#print(table2)
# Store the dataframe in Excel file
#table.to_excel("data.xlsx")
Output:
Account Account.1
ID: e007
Description: ABST: 198, SUR: J DOUTHIT
Geo ID: 014.0198.0000
What I am trying to figure out is how to remove the index for the rows and make the text before the first: to be a column header. In row 1 I have two: but everything after the first: should be the data for the column header.
I would like to take the above current output and and have ID, Description, and Geo ID as the column headers and the text that comes after the ':' to be the data for each of the headers.
I do not need 'Account' and 'Account.1' I believe these are being recognized as column headers. Below is what I would like the output to look like in Excel, but I cannot figure out how to format it correctly to export out to a CSV that can be imported. Maybe I do not even need to import or format into a CSV, the 'table.to_excel' function seems to not need that step.
+------+---------------------------+---------------+
| ID | Description | Geo ID |
+------+---------------------------+---------------+
| e007 | ABST: 198, SUR: J Douthit | 014.0198.0000 |
+------+---------------------------+---------------+
I was able to remove the index numbers, by using index_col=0 above where I define the dfs variable. Not sure that is the best way but it does do what I was trying to accomplish for that portion.
Since I am new to python I am having a hard time formatting my question into Google or StackOverflow to get the answers I am looking for. If someone could just point me in the right direction in what I am looking for, that would work but examples would be nice as well.
Thanks for any guidance
Upvotes: 2
Views: 1141
Reputation: 808
so to format your questions you can show us an example of what you want. try something like this:
|id|name|data1|data2|date3|-url-|
|--|----|-----|-----|-----|-----|
|1 |xyz |datax|datay|dataz|x:url|
|2 |xyz |datax|datay|dataz|x:url|
|3 |xyz |datax|datay|dataz|x:url|
...
Then you can ask questions about how to create the right Dataframe output that fits your desired design:)
you can also use this generator online: https://www.tablesgenerator.com/text_tables
+----+------+-------+-------+-------+------+
| Id | Name | Data1 | Data2 | Data3 | Url |
+----+------+-------+-------+-------+------+
| 1 | xyz | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+
| 2 | xyz | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+
| 3 | xyz | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+
Ok now that you have your Data Table design. next I would ask you to try using Jupyter Notebook. This will let you test your dataframes line by line. Each Test should be a new transfermation of the dataset.
How I see the workflow going to get to your needs: 1. See test and see what your current DF columns are:
print(df.columns)
2. use this command to edit your columns:
df.rename(columns={'old column 1':'ID',
'old column 2':'Description',
'old column 3':'Geo ID'},
inplace=True)
use this command change an index data
df.rename(index={0:'zero',1:'one'}, inplace=True)
use this command to change a row
df.loc['--insert_Column_here--', '--insert_row_here--'] = new_value
Upvotes: 2