prashanth manohar
prashanth manohar

Reputation: 680

Extracting several Tables from same spreadsheet and merging them to get a single continuous Data frame

I have a Spreadsheet of Time series Data

It has separate tables for each year with a single row gap in between. I want to have the Year from table header as part of the date column. So that I can plot charts and do simple comparisons of the data (YoY) etc

import pandas as pd

RigCountWorld_df = pd.read_excel(open('Worldwide Rig Count Nov 2022.xlsx', 'rb'),
              sheet_name='Worldwide_Rigcount',index_col=None, header=6)

RigCountWorld_df

The code I have is no where near helpful. Even the names of pandas operations I need to use will be helpful for me.

I need a continuous table with data from all years. It would make sense to have latest data at the very end.

Even transposing the tables separately and adding them as new columns would make sense (with the column headers containing Year-Month names.

Upvotes: 0

Views: 72

Answers (2)

Andreas
Andreas

Reputation: 159

Another solution validating year and month content (it is assumed, that the column names are in the first row of RigCountWorld_df):

df = RigCountWorld_df.copy()
first_col = 2  # First column with data
column_names = df.iloc[0, first_col:].to_list()
df["Year"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).str.match(r"^20\d\d$"), None).ffill()
df["Month"] = df.iloc[:,[1]].where(df.iloc[:,1].astype(str).isin(("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), None)
df = df[df['Month'].notna()]
df = df.iloc[:, first_col:].set_index(["Year", "Month"])
df.columns = column_names
df

Upvotes: 0

Timeless
Timeless

Reputation: 37787

Here is a proposition with some of pandas built-in functions.

import pandas as pd

df = pd.read_excel("Worldwide Rig Count Nov 2022.xlsx",
                   sheet_name="Worldwide_Rigcount", header=None, usecols="B:K", skiprows=6)
    ​
df.dropna(how="all", inplace=True)
df.insert(0, "Year", np.where(df[10].eq("Total World"), df[1], None))
df["Year"].ffill(inplace=True)
df.drop_duplicates(subset= df.columns[2:], inplace=True)
df.columns = ["Year", "Month"] + df.loc[0, 2:].tolist()
df = df.loc[1:, :].reset_index(drop=True)

# Output :

print(df.sample(5).to_string())
​
     Year Month Latin America Europe Africa Middle East Asia Pacific Total Intl. Canada  U.S. Total World
613  1975   Mar           310    113    122         173          208         926    192  1651        2769
588  1977   Apr           324    135    165         185          167         976    129  1907        3012
596  1977   Dec           353    142    172         195          182        1044    259  2141        3444
221  2005   Jan           307     57     50         242          204         860    550  1255        2665
566  1979   Aug           440    149    199         144          219        1151    376  2222        3749

# Check :

48 years with 13 rows (12 months + Average) for each year.

print(df.groupby("Year").size().value_counts())

13    48
dtype: int64

Upvotes: 1

Related Questions