Reputation: 11
I have a PDF document reporting COVID-19 numbers for counties in a state in a table in the document. I am reading the table into a pandas dataframe using camelot, and I am extracting the values in various rows based on the value, the county's name, in the first column. For that, I am using Boolean indexing as described here: How do I sum values in a column that match a given condition using pandas?
I am using the data extracted to report on the COVID-19 statistics in a subset of counties listed in the report that are of interest to my organization. I am also extracting the total numbers for the state, but the producers of the PDF cannot decide if they want to call that row of data "Gesamt" ("Total") or "Gesamtergebnis" ("Total result"). The dataframe I am working with, after camelot extracts the table from the PDF, looks like this:
0 1 2 3
...
9 A County 13.789 (+22) 1.566,0
10 My County 16.581 (+45) 3.040,0
11 Their County 7.445 (+15) 2.821,6
...
55 Gesamt 304.950 (+820) 2.747,2
The code below works, if they use "Gesamt." I would like to write it so that it will also work if they use "Gesamtergebnis." I cannot rely on the total ("Gesamt" or "Gesamtergebnis") always being in the same row.
# Open LGA reports for yesterday and the day before
# TO DO: Sometimes the LGA report is named COVID_Lagebericht_LGA_yymmdd.pdf or it ends in _01
# Add in a try/else statement to compensate for this
rptyes = f'Reports_LGA/{yday_yymmdd}_COVID_Tagesbericht_LGA.pdf'
rptdbf = f'Reports_LGA/{daybef_yymmdd}_COVID_Tagesbericht_LGA.pdf'
# Read the LGA reports into dataframes.
dfyes = camelot.read_pdf(rptyes, pages='2', flavor='stream')
dfdbf = camelot.read_pdf(rptdbf, pages='2', flavor='stream')
# Extract the statewide 7-D-I
# TO DO: Sometimes the last line says "Gesamt", sometimes "Gesamtergebnis" or something else.
# Add in some sort of error checking or try/else statement or regular expression to compensate
landindexyes = lambda land: dfyes[0].df.loc[dfyes[0].df[0] == land].index[0]
landindexdbf = lambda land: dfdbf[0].df.loc[dfdbf[0].df[0] == land].index[0]
land = 'Gesamt'
bwname = 'Baden-Württemberg'
bwcases = int(dfyes[0].df.loc[landindexyes(land), 1].replace('.',''))
bwcasesdiff = dfyes[0].df.loc[landindexyes(land), 2]
bwdeaths = int(dfyes[0].df.loc[landindexyes(land), 4].replace('.',''))
bwdeathsdiff = dfyes[0].df.loc[landindexyes(land), 5]
bw7diyes = float(dfyes[0].df.loc[landindexyes(land), 7].replace(',','.'))
bw7didbf = float(dfdbf[0].df.loc[landindexdbf(land), 7].replace(',','.'))
bw7didiff = bw7diyes - bw7didbf
rptrowsbw = [bwname, bwcases, bwcasesdiff, bwdeaths, bwdeathsdiff, bw7diyes, bw7didbf]
How can I use a regular expression to match either "Gesamt" or "Gesamtergebnis" in the variable passed to the lambda expressions 'landindexyes' and 'landindexdbf'?
If regular expressions are not the way to go, I am open to other suggestions. I am thinking that an if/else might work, but I don't think that would be as elegant.
Upvotes: 1
Views: 6960
Reputation: 126
Unfortunately, I can't see your data frame, so I can't write 100% correct lines. I would like to refer you to the first answer here: Filtering DataFrame by finding exact word (not combined) in a column of strings.
So, in your case something like:
df[df["column_name"].str.contains(r'(?:\s|^)Gesamt(?:\s|$)')]]==True
or
df[df["column_name"].str.contains(r'(?:\s|^)Gesamtergebnis(?:\s|$)')]]==True
If you are not sure that the spelling is correct in your dataset, you can try matching algorithms, such as Fuzzy Wuzzy: https://www.datacamp.com/community/tutorials/fuzzy-string-python.
Edit (from comments): RegEx slows code a lot, so what about an idea to change all "Gesamtergebnis" values into "Gesamt" in the column? So, you can use something like this in your TODO part:
df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt')
And continue with your code.
Upvotes: 2