Reputation: 47
I am trying to filter Data, for Excel Files in Pandas. Based on the Column Value i.e. String Value.
I Have tried the following to achieve what I want :-
Latest Code shown Below as of 07/08/2019
In [13]:
import pandas as pd
xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')
data = pd.read_excel(xls, sheet_name="Sheet1")
pd.options.display.max_rows = 1000
df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])
df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]
Out[13]:
Venue A/C DISPLAY/ Date BID
475 SHAWBURY DAK DISPLAY 2008-07-24 00:00:00 188
476 SHAWBURY SPIT DISPLAY 2008-07-24 00:00:00 188
477 COTTESMORE SPIT DISPLAY NaN 757
478 COTTESMORE DAK DISPLAY NaN 757
484 SUNDERLAND SPIT DISPLAY NaN 333
487 EAST FORTUNE SPIT DISPLAY NaN 406
489 WINDERMERE HS DISPLAY 2008-07-25 00:00:00 138
490 WINDERMERE DAK DISPLAY 2008-07-25 00:00:00 138
504 WIGTON DHS DISPLAY 2008-07-26 00:00:00 144
506 WINDERMERE HS DISPLAY 2008-07-26 00:00:00 138
507 WINDERMERE DAK DISPLAY 2008-07-26 00:00:00 138
508 SUNDERLAND HS DISPLAY NaN 333
509 SUNDERLAND DAK DISPLAY NaN 333
I want for example the SUNDERLAND Lines, to be changed to one line i.e. :-
SUNDERLAND DHS DISPLAY NaN 00:00:00
And for WINDERMERE two Lines i.e. :-
WINDERMERE DHS DISPLAY 2008-07-25 00:00:00
WINDERMERE DHS DISPLAY 2008-07-26 00:00:00
So what I am basically saying, from the Excel Table Data, displayed in Jupyter Notebook, how do I change all off :-
VENUE DAK DISPLAY 2008-09-10 00:00:00
VENUE HS DISPLAY 2008-09-10 00:00:00
(The Dates in the rows above and below, will be whatever is shown in that row, not the specific date given above and the ones below, except the WINDERMERE ones, shown above which is correct)
Or
VENUE HS DISPLAY 2008-09-06 00:00:00
VENUE DAK DISPLAY 2008-09-06 00:00:00
Or
VENUE DAK DISPLAY 2008-05-25 00:00:00
VENUE SPIT DISPLAY 2008-05-25 00:00:00
And the other way around, like with DAK and HS
VENUE same both lines i.e. SUNDERLAND i.e. :-
SUNDERLAND
SUNDERLAND
To :-
VENUE DHS DISPLAY 2008-09-10 00:00:00
like it says for the VENUE WIGTON
And the DAK and SPIT Ones to :-
VENUE DS DISPLAY 2008-09-10 00:00:00
i.e. SHAWBURY
Also if possible, to change the Year to 2009, in the date because 2008 is wrong.
And get rid of the other Data.
I have noticed that in the BID Column, which I now include. In the Data I wan't to keep, the Bid Numbers are the same in both rows, i.e. the same VENUE . So what should I type, to remove rows, where there is only 1 of a Bid Number i.e. the VENUE EAST FORTUNE.
But keep the Rows, where the Bid Numbers are the same, i.e. both lines directly underneath each other as shown in the picture ? Except where in the A/C Column the values are DHS or DSS ?
Hi Benoit, here is the output I get, which is different to yours for some reason.
[[475, 'SHAWBURY', 'DAK', 'DISPLAY', '2008-07-24 00:00:00', 188],
[476, 'SHAWBURY', 'SPIT', 'DISPLAY', '2008-07-24 00:00:00', 188],
[477, 'COTTESMORE', 'SPIT', 'DISPLAY', None, 757],
[478, 'COTTESMORE', 'DAK', 'DISPLAY', None, 757],
[484, 'SUNDERLAND', 'SPIT', 'DISPLAY', None, 333],
[487, 'EAST FORTUNE', 'SPIT', 'DISPLAY', None, 406],
[489, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-25 00:00:00', 138],
[490, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-25 00:00:00', 138],
[504, 'WIGTON', 'DHS', 'DISPLAY', '2008-07-26 00:00:00', 144],
[506, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-26 00:00:00', 138],
[507, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-26 00:00:00', 138],
[508, 'SUNDERLAND', 'HS', 'DISPLAY', None, 333],
[509, 'SUNDERLAND', 'DAK', 'DISPLAY', None, 333]]
Latest Data :-
Venue A/C DISPLAY/ Date BID
25 SHUTTLEWORTH DAK DISPLAY NaN 529
55 KEMBLE DAK DISPLAY NaN 461
69 NORTHWICH SPIT DISPLAY 2008-05-10 00:00:00 514
72 POCKLINGTON SPIT DISPLAY 2009-05-10 00:00:00 821
75 BERLIN DAK DISPLAY 2008-05-12 00:00:00 587
78 MILDENHALL SPIT DISPLAY 2009-05-15 00:00:00 920
93 DUXFORD HS DISPLAY NaN 611
103 CRANWELL HS DISPLAY 2008-05-20 00:00:00 44
145 SCARBOROUGH DAK DISPLAY 2008-05-25 00:00:00 610
150 SCARBOROUGH SPIT DISPLAY 2008-05-25 00:00:00 610
151 CORBRIDGE SPIT DISPLAY NaN 353
167 BRIDGEND-CNX SPIT DISPLAY 2008-05-31 00:00:00 527
173 TARRANT RUSHDEN HS DISPLAY NaN 132
174 TARRANT RUSHDEN DAK DISPLAY NaN 132
179 NORTHOLT SPIT DISPLAY 2009-06-05 00:00:00 870
214 BRIZE NORTON HS DISPLAY NaN 939
218 ROPLEY HS DISPLAY 2008-06-13 00:00:00 355
223 THWAITES HS DISPLAY NaN 364
231 ROPLEY HS DISPLAY NaN 355
240 COSFORD HS DISPLAY 2008-06-14 00:00:00 667
241 QUORN HS DISPLAY NaN 314
244 COSFORD DAK DISPLAY 2008-06-14 00:00:00 NaN
260 REDHILL SPIT DISPLAY NaN 686
269 KEMBLE DAK DISPLAY NaN 316
270 KEMBLE HS DISPLAY NaN 316
280 KEMBLE SPIT DISPLAY 2008-06-21 00:00:00 316
285 KEMBLE DAK DISPLAY 2008-06-21 00:00:00 316
317 BRUNTINGTHORPE SPIT DISPLAY NaN 106
329 ELDWICK SPIT DISPLAY NaN 430
333 PLYMOUTH HOE DAK DISPLAY 2008-06-28 00:00:00 528
340 OLD BUCKENHAM HS DISPLAY NaN 424
344 COSFORD SPIT DISPLAY 2008-06-28 00:00:00 68
350 TAIN SPIT DISPLAY NaN 433
355 WITTERING SPIT DISPLAY 2008-07-03 00:00:00 376
362 WADDINGTON HS DISPLAY 2008-07-04 00:00:00 666
372 ANNAN DAK DISPLAY NaN 606
373 ANNAN SPIT DISPLAY NaN 606
375 WADDINGTON HS DISPLAY 2008-07-05 00:00:00 666
389 SHAWBURY SPIT DISPLAY 2008-07-05 00:00:00 183
393 CRANWELL SPIT DISPLAY 2008-07-07 00:00:00 823
399 SWANSEA HS DISPLAY NaN 335
403 DUXFORD LEGENDS HS DISPLAY 2008-07-11 00:00:00 37
405 YEOVILTON HS DISPLAY NaN 549
407 WOODHOUSE HS DISPLAY 2008-07-11 00:00:00 545
429 CAPEL-LE-FERNE HS DISPLAY 2008-07-12 00:00:00 298
430 ARDINGLY(SUSSEX) HS DISPLAY NaN 189
439 CAERNARFON SPIT DISPLAY 2008-07-18 00:00:00 481
445 PENARTH DAK DISPLAY NaN 303
448 FAIRFORD(RIAT) SPIT DISPLAY 2008-07-19 00:00:00 625
455 CHOLMONDELEY CASTLE SPIT DISPLAY NaN 494
459 HAMPSTHWAITE DAK DISPLAY NaN 828
465 ODIHAM(CAS COMM) SPIT DISPLAY 2009-07-21 00:00:00 NaN
469 ODIHAM FAMS DAY SPIT DISPLAY 2008-07-22 00:00:00 6
470 HENLOW FAM DAY SPIT DISPLAY NaN 146
475 SHAWBURY DAK DISPLAY 2008-07-24 00:00:00 188
476 SHAWBURY SPIT DISPLAY 2008-07-24 00:00:00 188
477 COTTESMORE SPIT DISPLAY NaN 757
478 COTTESMORE DAK DISPLAY NaN 757
484 SUNDERLAND SPIT DISPLAY NaN 333
487 EAST FORTUNE SPIT DISPLAY NaN 406
489 WINDERMERE HS DISPLAY 2008-07-25 00:00:00 138
490 WINDERMERE DAK DISPLAY 2008-07-25 00:00:00 138
504 WIGTON DHS DISPLAY 2008-07-26 00:00:00 144
506 WINDERMERE HS DISPLAY 2008-07-26 00:00:00 138
507 WINDERMERE DAK DISPLAY 2008-07-26 00:00:00 138
508 SUNDERLAND HS DISPLAY NaN 333
509 SUNDERLAND DAK DISPLAY NaN 333
511 AUDLEM SPIT DISPLAY 2008-07-26 00:00:00 706
524 LYNEHAM FAM DAY SPIT DISPLAY NaN 662
525 MALVERN SPIT DISPLAY NaN 26
527 DAMYNS HALL SPIT DISPLAY 2008-08-08 00:00:00 766
529 DUXFORD SPIT DISPLAY 2008-08-09 00:00:00 612
530 DAMYNS HALL SPIT DISPLAY NaN 766
534 BLACKPOOL HS DISPLAY NaN 698
540 EASTBOURNE HS DISPLAY 2008-08-13 00:00:00 407
543 EASTBOURNE HS DISPLAY 2008-08-14 00:00:00 407
546 EASTBOURNE HS DISPLAY 2008-08-15 00:00:00 407
548 ROUGHAM SPIT DISPLAY 2009-08-15 00:00:00 551
551 DUXFORD DAK DISPLAY 2008-08-15 00:00:00 613
552 ROUGHAM DAK DISPLAY NaN 551
556 EASTBOURNE HS DISPLAY 2008-08-16 00:00:00 407
564 CROMER SPIT DISPLAY 2008-08-19 00:00:00 139
569 CRANWELL SPIT DISPLAY 2009-08-20 00:00:00 913
578 SHOREHAM SPIT DISPLAY NaN 366
581 RAMSEY SPIT DISPLAY 2009-08-22 00:00:00 896
583 WHITBY DAK DISPLAY 2008-08-22 00:00:00 125
584 WHITBY HS DISPLAY 2008-08-22 00:00:00 125
586 WHITBY HS DISPLAY 2008-08-23 00:00:00 125
587 WHITBY DAK DISPLAY 2008-08-23 00:00:00 125
591 SHOREHAM HS DISPLAY 2008-08-23 00:00:00 366
605 TWINWOOD AIRFIELD DAK DISPLAY 2008-08-30 00:00:00 934
608 ROMSEY DAK DISPLAY NaN 175
613 STONELEIGH PARK HS DISPLAY NaN NaN
614 DUNSFOLD HS DISPLAY NaN 583
622 ROMSEY DAK DISPLAY 2008-08-31 00:00:00 175
623 ROMSEY HS DISPLAY NaN 175
625 STONELEIGH PARK HS DISPLAY NaN 563
629 TWINWOOD AIRFIELD DAK DISPLAY NaN 934
633 PLYMOUTH HOE SPIT DISPLAY 2008-09-03 00:00:00 295
646 PORTRUSH DAK DISPLAY 2008-09-05 00:00:00 76
647 PORTRUSH HS DISPLAY 2008-09-05 00:00:00 76
652 DUXFORD HS DISPLAY NaN 615
659 PORTRUSH HS DISPLAY 2008-09-06 00:00:00 76
660 PORTRUSH DAK DISPLAY 2008-09-06 00:00:00 76
667 GREAT YARMOUTH HS DISPLAY NaN 655
678 ODIHAM SPIT DISPLAY 2008-09-09 00:00:00 137
686 GUERNSEY DAK DISPLAY 2008-09-10 00:00:00 582
687 GUERNSEY HS DISPLAY 2008-09-10 00:00:00 582
688 JERSEY DAK DISPLAY 2008-09-10 00:00:00 581
689 JERSEY HS DISPLAY 2008-09-10 00:00:00 581
711 WOODSTOCK(BLENHEIM) HS DISPLAY 2008-09-12 00:00:00 362
717 MORECAMBE SPIT DISPLAY NaN 368
720 WOODSTOCK DAK DISPLAY 2008-09-13 00:00:00 362
721 WOODSTOCK HS DISPLAY NaN 362
744 NORTHOLT HS DISPLAY 2008-09-17 00:00:00 532
746 GOODWOOD HS DISPLAY 2008-09-18 00:00:00 256
755 GOODWOOD HS DISPLAY 2008-09-19 00:00:00 256
756 NORTHOLT HS DISPLAY 2009-09-19 00:00:00 863
763 KEMBLE HS DISPLAY NaN 330
766 KEMBLE HS DISPLAY 2008-09-20 00:00:00 330
774 SANICOLE DAK DISPLAY NaN 69
776 GOODWOOD HS DISPLAY 2008-09-20 00:00:00 256
777 SANICOLE HS DISPLAY NaN 69
790 SOUTHPORT HS DISPLAY NaN 584
791 SOUTHPORT DAK DISPLAY NaN 584
803 DUXFORD SPIT DISPLAY 2008-10-11 00:00:00 616
Any help would be appreciated
Regards
Eddie Winch
Upvotes: 0
Views: 6720
Reputation: 969
[Updated] - This is kin of weird but it respects the rules you want to apply
(which are a little weird as well, so it makes sense)
In [1]:
import pandas as pd
data = [
[475, 'SHAWBURY', 'DAK', 'DISPLAY', '2008-07-24 00:00:00', 188],
[476, 'SHAWBURY', 'SPIT', 'DISPLAY', '2008-07-24 00:00:00', 188],
[477, 'COTTESMORE', 'SPIT', 'DISPLAY', None, 757],
[478, 'COTTESMORE', 'DAK', 'DISPLAY', None, 757],
[484, 'SUNDERLAND', 'SPIT', 'DISPLAY', None, 333],
[487, 'EAST FORTUNE', 'SPIT', 'DISPLAY', None, 406],
[489, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-25 00:00:00', 138],
[490, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-25 00:00:00', 138],
[504, 'WIGTON', 'DHS', 'DISPLAY', '2008-07-26 00:00:00', 144],
[506, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-26 00:00:00', 138],
[507, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-26 00:00:00', 138],
[508, 'SUNDERLAND', 'HS', 'DISPLAY', None, 333],
[509, 'SUNDERLAND', 'DAK', 'DISPLAY', None, 333]
]
df = pd.DataFrame(data, columns=['Index', 'Venue', 'A/C', 'DISPLAY', 'Date', 'BID']).set_index('Index')
df
Out [1]:
Venue A/C DISPLAY Date BID
Index
475 SHAWBURY DAK DISPLAY 2008-07-24 00:00:00 188
476 SHAWBURY SPIT DISPLAY 2008-07-24 00:00:00 188
477 COTTESMORE SPIT DISPLAY None 757
478 COTTESMORE DAK DISPLAY None 757
484 SUNDERLAND SPIT DISPLAY None 333
487 EAST FORTUNE SPIT DISPLAY None 406
489 WINDERMERE HS DISPLAY 2008-07-25 00:00:00 138
490 WINDERMERE DAK DISPLAY 2008-07-25 00:00:00 138
504 WIGTON DHS DISPLAY 2008-07-26 00:00:00 144
506 WINDERMERE HS DISPLAY 2008-07-26 00:00:00 138
507 WINDERMERE DAK DISPLAY 2008-07-26 00:00:00 138
508 SUNDERLAND HS DISPLAY None 333
509 SUNDERLAND DAK DISPLAY None 333
In [2] :
## Keep BID where we have at least 2 rows
test = df.groupby(by=['BID', 'Venue', 'DISPLAY']).count()
test = test[test['A/C']>1]
bids = test.reset_index().BID.tolist()
# Here if there is already `DHS` and `DS` in the column `A/C`, I want to keep them
df.loc[df['A/C']=='DHS', 'Aircraft'] = 'DHS'
df.loc[df['A/C']=='DS', 'Aircraft'] = 'DS'
# I keep 1 row for each bid that has at least 2 rows, and their Aircraft's value are updated
for bid in bids:
df.loc[(df['BID']==bid) & (df['A/C']=='DAK'), 'Aircraft']= 'DHS'
df.loc[(df['BID']==bid) & (df['A/C']=='SPIT'), 'Aircraft'] = 'DS'
df = df[df['Aircraft'].notnull()].drop(columns=['A/C'], axis=1)
data
Out [2]:
Venue DISPLAY Date BID Aircraft
Index
475 SHAWBURY DISPLAY 2008-07-24 00:00:00 188 DHS
476 SHAWBURY DISPLAY 2008-07-24 00:00:00 188 DS
477 COTTESMORE DISPLAY None 757 DS
478 COTTESMORE DISPLAY None 757 DHS
484 SUNDERLAND DISPLAY None 333 DS
490 WINDERMERE DISPLAY 2008-07-25 00:00:00 138 DHS
504 WIGTON DISPLAY 2008-07-26 00:00:00 144 DHS
507 WINDERMERE DISPLAY 2008-07-26 00:00:00 138 DHS
509 SUNDERLAND DISPLAY None 333 DHS
Upvotes: 1