Rey
Rey

Reputation: 29

Format pandas output to csv

I am new to python and pandas and have created a test web page with html code to use to help with learning how to pull the data and then format into CSV for use in excel. Below is the code I have come up with that puts it into a nice format but I am stuck on how to format it into a CSV file to import.

Code:

# Importing pandas 
import pandas as pd 

# The webpage URL whose table we want to extract 
url = "/home/dvm01/e007"

# Assign the table data to a Pandas dataframe 
table = pd.read_html(url,**index_col=0**)[0]
#table2 = pd.read_html(url)[0],pd.read_html(url)[1],pd.read_html(url)[6]

# Print the dataframe 
print(table)
#print(table2)

# Store the dataframe in Excel file 
#table.to_excel("data.xlsx") 

Output:

            Account                                          Account.1
ID:                                         e007
Description:  ABST: 198, SUR: J DOUTHIT
Geo ID:                            014.0198.0000

What I am trying to figure out is how to remove the index for the rows and make the text before the first: to be a column header. In row 1 I have two: but everything after the first: should be the data for the column header.

I would like to take the above current output and and have ID, Description, and Geo ID as the column headers and the text that comes after the ':' to be the data for each of the headers.

I do not need 'Account' and 'Account.1' I believe these are being recognized as column headers. Below is what I would like the output to look like in Excel, but I cannot figure out how to format it correctly to export out to a CSV that can be imported. Maybe I do not even need to import or format into a CSV, the 'table.to_excel' function seems to not need that step.

+------+---------------------------+---------------+
| ID   | Description               | Geo ID        |
+------+---------------------------+---------------+
| e007 | ABST: 198, SUR: J Douthit | 014.0198.0000 |
+------+---------------------------+---------------+

I was able to remove the index numbers, by using index_col=0 above where I define the dfs variable. Not sure that is the best way but it does do what I was trying to accomplish for that portion.

Since I am new to python I am having a hard time formatting my question into Google or StackOverflow to get the answers I am looking for. If someone could just point me in the right direction in what I am looking for, that would work but examples would be nice as well.

Thanks for any guidance

Upvotes: 2

Views: 1141

Answers (1)

BlackFox
BlackFox

Reputation: 808

so to format your questions you can show us an example of what you want. try something like this:

|id|name|data1|data2|date3|-url-|
|--|----|-----|-----|-----|-----|
|1 |xyz |datax|datay|dataz|x:url|
|2 |xyz |datax|datay|dataz|x:url|
|3 |xyz |datax|datay|dataz|x:url|
...

Then you can ask questions about how to create the right Dataframe output that fits your desired design:)

you can also use this generator online: https://www.tablesgenerator.com/text_tables

+----+------+-------+-------+-------+------+
| Id | Name | Data1 | Data2 | Data3 | Url  |
+----+------+-------+-------+-------+------+
| 1  | xyz  | datax | datay | dataz | xurl | 
+----+------+-------+-------+-------+------+
| 2  | xyz  | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+
| 3  | xyz  | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+

Ok now that you have your Data Table design. next I would ask you to try using Jupyter Notebook. This will let you test your dataframes line by line. Each Test should be a new transfermation of the dataset.

How I see the workflow going to get to your needs: 1. See test and see what your current DF columns are:

print(df.columns)

2. use this command to edit your columns:

df.rename(columns={'old column 1':'ID',
                          'old column 2':'Description',
                          'old column 3':'Geo ID'}, 
                 inplace=True)
  1. use this command change an index data

    df.rename(index={0:'zero',1:'one'}, inplace=True)

  2. use this command to change a row

    df.loc['--insert_Column_here--', '--insert_row_here--'] = new_value
    

Upvotes: 2

Related Questions