Reputation: 13
Here's my excel data:
Main Topic | ||||||||
---|---|---|---|---|---|---|---|---|
Sr No | column 1 | column 2 | ||||||
Sr No | sub-col1 | sub-col2 | sub-col3 | sub-col4 | sub-col1 | sub-col2 | sub-col3 | sub-col4 |
First Topic | ||||||||
1) Sub Topic-1 | ||||||||
1 | 107 | 207 | 307 | 407 | 507 | 607 | 707 | 807 |
2 | 108 | 208 | 308 | 408 | 508 | 608 | 708 | 808 |
3 | 109 | 209 | 309 | 409 | 509 | 609 | 709 | 809 |
4 | 110 | 210 | 310 | 410 | 510 | 610 | 710 | 810 |
2) Sub Topic-2 | ||||||||
1 | 113 | 213 | 313 | 413 | 513 | 613 | 713 | 813 |
2 | 114 | 214 | 314 | 414 | 514 | 614 | 714 | 814 |
3 | 115 | 215 | 315 | 415 | 515 | 615 | 715 | 815 |
4 | 116 | 216 | 316 | 416 | 516 | 616 | 716 | 816 |
Second Topic | ||||||||
1) Sub Topic-1 | ||||||||
1 | 120 | 220 | 320 | 420 | 520 | 620 | 720 | 820 |
2 | 121 | 221 | 321 | 421 | 521 | 621 | 721 | 821 |
3 | 122 | 222 | 322 | 422 | 522 | 622 | 722 | 822 |
4 | 123 | 223 | 323 | 423 | 523 | 623 | 723 | 823 |
2) Sub Topic-2 | ||||||||
1 | 126 | 226 | 326 | 426 | 526 | 626 | 726 | 826 |
2 | 127 | 227 | 327 | 427 | 527 | 627 | 727 | 827 |
3 | 128 | 228 | 328 | 428 | 528 | 628 | 728 | 828 |
4 | 129 | 229 | 329 | 429 | 529 | 629 | 729 | 829 |
image1.png
Now, I want to change the data (because my data modeling guy said to make it this way before adding to sql) and I don't know how to change it to such dataframe.
The change is shown in below image:
image2.png
P.S. Note: This is dummy data and the length of rows and columns always vary. Any help me appreciated.
Upvotes: 0
Views: 763
Reputation: 31166
This is just about being systematic
src = """| Main Topic | | | | | | | | |
|----------------|----------|----------|----------|----------|----------|----------|----------|----------|
| Sr No | column 1 | | | | column 2 | | | |
| Sr No | sub-col1 | sub-col2 | sub-col3 | sub-col4 | sub-col1 | sub-col2 | sub-col3 | sub-col4 |
| First Topic | | | | | | | | |
| 1) Sub Topic-1 | | | | | | | | |
| 1 | 107 | 207 | 307 | 407 | 507 | 607 | 707 | 807 |
| 2 | 108 | 208 | 308 | 408 | 508 | 608 | 708 | 808 |
| 3 | 109 | 209 | 309 | 409 | 509 | 609 | 709 | 809 |
| 4 | 110 | 210 | 310 | 410 | 510 | 610 | 710 | 810 |
| | | | | | | | | |
| 2) Sub Topic-2 | | | | | | | | |
| 1 | 113 | 213 | 313 | 413 | 513 | 613 | 713 | 813 |
| 2 | 114 | 214 | 314 | 414 | 514 | 614 | 714 | 814 |
| 3 | 115 | 215 | 315 | 415 | 515 | 615 | 715 | 815 |
| 4 | 116 | 216 | 316 | 416 | 516 | 616 | 716 | 816 |
| | | | | | | | | |
| Second Topic | | | | | | | | |
| 1) Sub Topic-1 | | | | | | | | |
| 1 | 120 | 220 | 320 | 420 | 520 | 620 | 720 | 820 |
| 2 | 121 | 221 | 321 | 421 | 521 | 621 | 721 | 821 |
| 3 | 122 | 222 | 322 | 422 | 522 | 622 | 722 | 822 |
| 4 | 123 | 223 | 323 | 423 | 523 | 623 | 723 | 823 |
| | | | | | | | | |
| 2) Sub Topic-2 | | | | | | | | |
| 1 | 126 | 226 | 326 | 426 | 526 | 626 | 726 | 826 |
| 2 | 127 | 227 | 327 | 427 | 527 | 627 | 727 | 827 |
| 3 | 128 | 228 | 328 | 428 | 528 | 628 | 728 | 828 |
| 4 | 129 | 229 | 329 | 429 | 529 | 629 | 729 | 829 |"""
# uniform grid of data...
df = (pd.read_csv(io.StringIO(src), sep="|", header=None).drop(columns=[0,10])
.applymap(str.strip).replace("",np.nan).dropna(how="all").drop(1)
.reset_index(drop=True)
.fillna(method="ffill", axis=1)
)
# first 3 rows are column names.. of which first is the main topic
df2 = df.iloc[3:].copy()
# extract topic and sub-topic from relevent rows, put onto other rows with fillna()
df2 = (df2.assign(st=df2[1].str.contains(r"^\d+\).*\-\d+$"),
t=lambda dfa: ~dfa.st & dfa[1].str.contains("^[A-Z,a-z, ]+"),
tk=lambda dfa: np.where(dfa.t, dfa[1], np.nan),
stk=lambda dfa: np.where(dfa.st, dfa[1], np.nan))
.fillna(method="ffill")
)
# remove rows that contained topic & sub-topic, plus working columns
# topic & sub-topic are part of the index
df2 = df2.drop(df2.loc[df2.t | df2.st].index).drop(columns=["t","st"]).set_index(["tk","stk"])
# back to first 3 rows are column names, 1st being main topic
df2.columns = pd.MultiIndex.from_arrays(df.iloc[0:3].values, names=["topic","col","sub-col"])
# move SR No into index, then stack first level of column multi-index into row index
# cleanup order of items in the index
df2 = df2.set_index(df2.columns[0], append=True).stack(0).reorder_levels([3,0,1,2])
# cleanup index name
df2.reindex(df2.index.set_names("SR No",level=3))
col column 1 column 2
sub-col sub-col1 sub-col2 sub-col3 sub-col4 sub-col1 sub-col2 sub-col3 sub-col4
topic tk stk SR No
Main Topic First Topic 1) Sub Topic-1 1 107 207 307 407 507 607 707 807
2 108 208 308 408 508 608 708 808
3 109 209 309 409 509 609 709 809
4 110 210 310 410 510 610 710 810
2) Sub Topic-2 1 113 213 313 413 513 613 713 813
2 114 214 314 414 514 614 714 814
3 115 215 315 415 515 615 715 815
4 116 216 316 416 516 616 716 816
Second Topic 1) Sub Topic-1 1 120 220 320 420 520 620 720 820
2 121 221 321 421 521 621 721 821
3 122 222 322 422 522 622 722 822
4 123 223 323 423 523 623 723 823
2) Sub Topic-2 1 126 226 326 426 526 626 726 826
2 127 227 327 427 527 627 727 827
3 128 228 328 428 528 628 728 828
4 129 229 329 429 529 629 729 829
Upvotes: 1