How to use regular expressions to find a value in a pandas dataframe column

Question

I have a PDF document reporting COVID-19 numbers for counties in a state in a table in the document. I am reading the table into a pandas dataframe using camelot, and I am extracting the values in various rows based on the value, the county's name, in the first column. For that, I am using Boolean indexing as described here: How do I sum values in a column that match a given condition using pandas?

I am using the data extracted to report on the COVID-19 statistics in a subset of counties listed in the report that are of interest to my organization. I am also extracting the total numbers for the state, but the producers of the PDF cannot decide if they want to call that row of data "Gesamt" ("Total") or "Gesamtergebnis" ("Total result"). The dataframe I am working with, after camelot extracts the table from the PDF, looks like this:

                0        1       2        3
...
9        A County   13.789   (+22)  1.566,0
10      My County   16.581   (+45)  3.040,0
11   Their County    7.445   (+15)  2.821,6
... 
55         Gesamt  304.950  (+820)  2.747,2

The code below works, if they use "Gesamt." I would like to write it so that it will also work if they use "Gesamtergebnis." I cannot rely on the total ("Gesamt" or "Gesamtergebnis") always being in the same row.

# Open LGA reports for yesterday and the day before
# TO DO: Sometimes the LGA report is named COVID_Lagebericht_LGA_yymmdd.pdf or it ends in _01
#        Add in a try/else statement to compensate for this
rptyes = f'Reports_LGA/{yday_yymmdd}_COVID_Tagesbericht_LGA.pdf'
rptdbf = f'Reports_LGA/{daybef_yymmdd}_COVID_Tagesbericht_LGA.pdf'

# Read the LGA reports into dataframes.
dfyes = camelot.read_pdf(rptyes, pages='2', flavor='stream')
dfdbf = camelot.read_pdf(rptdbf, pages='2', flavor='stream')

# Extract the statewide 7-D-I
# TO DO: Sometimes the last line says "Gesamt", sometimes "Gesamtergebnis" or something else.
#        Add in some sort of error checking or try/else statement or regular expression to compensate
landindexyes = lambda land: dfyes[0].df.loc[dfyes[0].df[0] == land].index[0]
landindexdbf = lambda land: dfdbf[0].df.loc[dfdbf[0].df[0] == land].index[0]
land = 'Gesamt'
bwname = 'Baden-Württemberg'
bwcases = int(dfyes[0].df.loc[landindexyes(land), 1].replace('.',''))
bwcasesdiff = dfyes[0].df.loc[landindexyes(land), 2]
bwdeaths = int(dfyes[0].df.loc[landindexyes(land), 4].replace('.',''))
bwdeathsdiff = dfyes[0].df.loc[landindexyes(land), 5]
bw7diyes = float(dfyes[0].df.loc[landindexyes(land), 7].replace(',','.'))
bw7didbf = float(dfdbf[0].df.loc[landindexdbf(land), 7].replace(',','.'))
bw7didiff = bw7diyes - bw7didbf
rptrowsbw = [bwname, bwcases, bwcasesdiff, bwdeaths, bwdeathsdiff, bw7diyes, bw7didbf]

How can I use a regular expression to match either "Gesamt" or "Gesamtergebnis" in the variable passed to the lambda expressions 'landindexyes' and 'landindexdbf'?

If regular expressions are not the way to go, I am open to other suggestions. I am thinking that an if/else might work, but I don't think that would be as elegant.

How to use regular expressions to find a value in a pandas dataframe column

Answers (1)

Related Questions