user16993210
user16993210

Reputation:

Edit excel file with python pandas

I am working on a couple of excel sheets which all look like the first one attached. Currently I'm trying to remove the first 11 rows, and also the first column based on the index. Somehow I can not get it to work, The title "Eidgenössische Volkszählung 2000. Bevölkerungsentwicklung der Gemeinden 1850-2000" is still visible. It removes all the rows without the Title.

My end goal would be to make it look like the second picture, shouldnt be that hard i think.

If someone could tell me how to remove rows and colums based on the index i would be grateful. I did not find any solution within my 1h google search.

import pandas as pd

df = pd.read_excel("AG.xls", "AG")
df = df.drop([0,1,2,3,4,5,6,7,8,9,10])
print(df)

enter image description here

enter image description here

Upvotes: 1

Views: 441

Answers (3)

arce.est
arce.est

Reputation: 379

Try:

import pandas as pd
import numpy as np

df = pd.read_excel(
    "AG.xls", 
    sheet_name="AG", 
    skiprows=10, 
    skipfooter=2, 
    usecols='A:W')

df.columns = ['id','city','1850','1860','1870','1880','1888','1900','1910','1920',
'1930','1941','1950','1960','1970','1980','1990','2000',
'1850 / 1880','1880 / 1910','1910 / 1941','1941 / 1970','1970 / 2000']

df = df[df['id'].notnull()]

# replace column by value identify by "Total Acum"
df['city'] = df['city'].fillna('Total Acum')

# clean and replace values "..." to NaN values
df = df.replace('...', np.nan)

# or remove from dataframe
# df = df[df['city'].notnull()]

df.head(5)

I recommend that you manage the data that are separated by header with the accumulated values, so that later you can decide what to do with them.

The result is as follows

enter image description here

Upvotes: 0

garagnoth
garagnoth

Reputation: 234

Try this one

Option 1: dropping the rows and column while reading

df = pd.read_excel('AG.xlsx', skiprows=list(range(10))).drop('Unnamed: 0', axis=1)

To skip rows, you can use skiprows argument for read_excel, further reading. Since you only need to drop the first 10 rows, we use list(range(10)) to construct a list of 0 to 9.

And then, since we're going to drop the first column and since there's no label for the column, then we can use drop('Unnamed: 0', axis=1) to do that. Pandas will label the first empty column label as Unnamed: 0.

Option 2: dropping rows and column after reading If you want to drop the rows and column after making the dataframe, then you can do this.

# drop rows
new_df = df.drop(labels=list(range(10)))

# drop columns
new_df = new_df.drop(new_df.columns[0], axis=1)

Upvotes: 0

Corralien
Corralien

Reputation: 120559

Try:

import pathlib

data_dir = pathlib.Path('.')

data = []
for xlsfile in data_dir.glob('??.xls'):
    df = pd.read_excel(xlsfile, skiprows=8)
    df = df[pd.to_numeric(df.iloc[:, 0], errors='coerce').notna()]
    df = pd.concat([df.iloc[:, 1:].set_index('Unnamed: 1')], 
                   keys=[xlsfile.stem], names=['Kanton', 'Stadt'])
    data.append(df)

df = pd.concat(data)

Rows are filtered with the first column. If the column Unnamed: 0 can be converted to numeric then keep it else drop. I use skiprows=8 to get the column headers (1850, 1860, ...). The index of each dataframe is a MultiIndex: the filename name - Kanton - (xlsfile.stem) and the city - Stadt - (column Unnamed: 1).

Output result:

>>> df
                          1850    1860    1870    1880    1888    1900    1910    1920     1930     1941     1950     1960     1970     1980     1990     2000 1850 / 1880 1880 / 1910 1910 / 1941 1941 / 1970 1970 / 2000
Kanton Stadt                                                                                                                                                                                                              
TI     Arbedo-Castione     801     773     800     839     873    1042    1200    1212     1260     1302     1335     1467     2456     3058     3570     3729         0.2         1.2         0.3         2.2         1.4
       Bellinzona 1)      3209    3462    3950    4038    5553    8255   10406   10232    10706    10948    12060    13435    16979    16743    16849    16463         0.8         3.2         0.2         1.5        -0.1
       Cadenazzo           216     229     231     538     316     333     408     481      482      505      621      693      995     1179     1500     1755         3.1        -0.9         0.7         2.4         1.9
       Camorino            321     337     467     483     423     405     491     571      548      552      702      920     1376     1476     1709     2210         1.4         0.1         0.4         3.2         1.6
       Giubiasco 2)       1417    1503    1539    1587    1570    1722    2395    2585     2607     2932     3311     4281     5796     6585     6982     7418         0.4         1.4         0.7         2.4         0.8
...                        ...     ...     ...     ...     ...     ...     ...     ...      ...      ...      ...      ...      ...      ...      ...      ...         ...         ...         ...         ...         ...
ZG     Risch            1005.0  1027.0   896.0  1235.0  1171.0  1047.0  1106.0  1258.0   1281.0   1449.0   1630.0   2038.0   3182.0   3988.0   5414.0   7241.0         0.7        -0.4         0.9         2.7         2.8
       Steinhausen       490.0   465.0   436.0   468.0   498.0   443.0   470.0   560.0    743.0    787.0   1078.0   1621.0   4138.0   6082.0   7207.0   8801.0        -0.2         0.0         1.7         5.9         2.5
       Unterägeri       2243.0  2423.0  2565.0  2426.0  2378.0  2593.0  2502.0  2787.0   3005.0   2969.0   3340.0   3832.0   4671.0   5371.0   6151.0   7179.0         0.3         0.1         0.6         1.6         1.4
       Walchwil         1039.0  1030.0  1072.0  1054.0  1021.0  1059.0  1043.0  1046.0   1072.0   1118.0   1226.0   1400.0   1675.0   2181.0   2654.0   3150.0         0.0         0.0         0.2         1.4         2.1
       Zug  1)          3302.0  3854.0  4243.0  4805.0  5120.0  6508.0  8096.0  9499.0  11113.0  12372.0  14488.0  19792.0  22972.0  21609.0  21705.0  22973.0         1.3         1.8         1.4         2.2         0.0

[2896 rows x 21 columns]

Upvotes: 0

Related Questions