Ed_in_NY
Ed_in_NY

Reputation: 53

Extract instances of a patterned text sequence from a very long string using Python

I am working with this PDF document of about 80 pages. It lists all 1,984 US senators from US history in chronological order. I have extracted the text of the document using PyPDF2. The text is now assigned to a variable as a single, long string. Here is a segment:


Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835  281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827  282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829  283 November 27 McKinley, John (J-AL) March 3, 1831  284   (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831   (First served 1816-1823)  * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829  285   TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829  March 4 Barnard, Isaac D. (J-PA) December 6, 1831  286  Ellis, Powhatan (J-MS) July 16, 1832   (First served 1825-1826)  Foot, Samuel A. (Adams/AJ-CT) March 3, 1833  287  McLane, Louis (J-DE) April 16, 1829  288  Parris, Albion K. (J-ME) August 26, 1828  289  Tyler, John (J/AJ-VA) February 29, 1836  290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841  291   (Served again 1845)  * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829  292 Start of Initial    Senate Service Name/Party   End of Service Rank    15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831  293 December 15 Iredell, James (J-NC) March 3, 1831  294  * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833  295  Holmes, John (Adams/AJ-ME) March 3, 1833   (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833   (First served 1817-1829)

Notice that the name, party affiliation, state, end of service date, and rank of each senator normally appear in a patterned segment. Here are some examples:


Rodney, Daniel (Adams-DE) January 12, 1827 282  
Bateman, Ephraim (Adams-NJ) January 12, 1829 283  
Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293

But there are also some exceptions, such as these:


Smith, William (R-SC) March 3, 1831 (First served 1816-1823)  
Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827)

In these cases the rank is given when the senator is first listed.

My question is, how can I extract the basic information on each senator (name, party, state, end of service, rank)? I believe I need to loop through the string, finding all instances of a regular expression that captures the patterns, and assign each instance to a list within a list. The end result would be a list of lists that I could transform into a dataframe in pandas.

Upvotes: 0

Views: 70

Answers (1)

Alexandre B.
Alexandre B.

Reputation: 5502

You can try the following approach::

d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]

df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')

Workflow:

  1. Split the input on , surrounded by two Names:

    1. Use the regex [a-zA-Z]+\,\s+[a-zA-Z]+
    2. Surround the regex by parenthesis because the split key (e.g. the names) need to be kept
    3. Apply regex using re.split
    4. Remove first element that is empty space
  2. Here, we have all the lines bu divided in two elements. We need to aggregate two consecutive element. The topic Create a 2D list out of 1D list answer this step.

  3. Now content can be extracted from each row. Here, we use re.findall with regex (.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$. They are 4 groups:

    • Group 1 selects everything till a parenthesis: (.*?)\s+\(
    • Group 2 selects everything till the closing parenthesis: (.*?)\)
    • Group 3 selects everything till a year (e.g. 4 numbers): (.*?\d{4})
    • Group 4 selects everything till the end: (.*?)$

For a better understanding of regex, I advice you to see online regex such as regex101.com to visualize the results...

  1. Create the dataframe

Next steps, apply more specific cleanings and separation on dataset such as removing comma on name with:

df["name"] = df["name"].str.replace(r'\,$','')

Code + illustration

# import module
import pandas as pd
import re


d = "Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835  281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827  282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829  283 November 27 McKinley, John (J-AL) March 3, 1831  284   (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831   (First served 1816-1823)  * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829  285   TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829  March 4 Barnard, Isaac D. (J-PA) December 6, 1831  286  Ellis, Powhatan (J-MS) July 16, 1832   (First served 1825-1826)  Foot, Samuel A. (Adams/AJ-CT) March 3, 1833  287  McLane, Louis (J-DE) April 16, 1829  288  Parris, Albion K. (J-ME) August 26, 1828  289  Tyler, John (J/AJ-VA) February 29, 1836  290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841  291   (Served again 1845)  * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829  292 Start of Initial    Senate Service Name/Party   End of Service Rank    15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831  293 December 15 Iredell, James (J-NC) March 3, 1831  294  * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833  295  Holmes, John (Adams/AJ-ME) March 3, 1833   (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833   (First served 1817-1829)"

# Step 1
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
print(d)
# ['Silsbee, Nathaniel', ' (Adams/AJ-MA) March 3, 1835  281 November 8 ',
#  'Rodney, Daniel', ' (Adams-DE) January 12, 1827  282 November 9 ', 
#  'Bateman, Ephraim', ' (Adams-NJ) January 12, 1829  283 November 27 ', 
#  'McKinley, John', ' (J-AL) March 3, 1831  284   (Served again 1837) November 29 ', 
#  'Smith, William', ' (R-SC) March 3, 1831   (First served 1816-1823)  * * * 1827 * * * January 12 ',
#  'Ridgely, Henry', ' M. (J-DE) March 3, 1829  285   TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829  March 4 ',
#  'Barnard, Isaac', ' D. (J-PA) December 6, 1831  286  ',
#  'Ellis, Powhatan', ' (J-MS) July 16, 1832   (First served 1825-1826)  ',
#  'Foot, Samuel', ' A. (Adams/AJ-CT) March 3, 1833  287  ',
#  'McLane, Louis', ' (J-DE) April 16, 1829  288  ',
#  'Parris, Albion', ' K. (J-ME) August 26, 1828  289  ',
#  'Tyler, John', ' (J/AJ-VA) February 29, 1836  290 December 17 ',
#  'Webster, Daniel', ' (Adams/AJ/W-MA) February 22, 1841  291   (Served again 1845)  * * * 1828 * * * November 7 ',
#  'Prince, Oliver', ' H. (J-GA) March 3, 1829  292 Start of Initial    Senate Service Name/Party   End of Service Rank    15 December 10 ',
#  'Burnet, Jacob', ' (Adams/AJ-OH) March 3, 1831  293 December 15 ',
#  'Iredell, James', ' (J-NC) March 3, 1831  294  * * * 1829 * * * January 15 ',
#  'Dudley, Charles', ' E. (J-NY) March 3, 1833  295  ',
#  'Holmes, John', ' (Adams/AJ-ME) March 3, 1833   (First served 1820-1827) January 30 ',
#  'Dickerson, Mahlon', ' (R/CR/J-NJ) March 3, 1833   (First served 1817-1829)']


# Step 2
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
print(d)
# ['Silsbee, Nathaniel,  (Adams/AJ-MA) March 3, 1835  281 November 8 ',
#  'Rodney, Daniel,  (Adams-DE) January 12, 1827  282 November 9 ',
#  'Bateman, Ephraim,  (Adams-NJ) January 12, 1829  283 November 27 ',
#  'McKinley, John,  (J-AL) March 3, 1831  284   (Served again 1837) November 29 ',
#  'Smith, William,  (R-SC) March 3, 1831   (First served 1816-1823)  * * * 1827 * * * January 12 ',
#  'Ridgely, Henry,  M. (J-DE) March 3, 1829  285   TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829  March 4 ',
#  'Barnard, Isaac,  D. (J-PA) December 6, 1831  286  ',
#  'Ellis, Powhatan,  (J-MS) July 16, 1832   (First served 1825-1826)  ',
#  'Foot, Samuel,  A. (Adams/AJ-CT) March 3, 1833  287  ', 'McLane, Louis,  (J-DE) April 16, 1829  288  ',
#  'Parris, Albion,  K. (J-ME) August 26, 1828  289  ',
#  'Tyler, John,  (J/AJ-VA) February 29, 1836  290 December 17 ',
#  'Webster, Daniel,  (Adams/AJ/W-MA) February 22, 1841  291   (Served again 1845)  * * * 1828 * * * November 7 ',
#  'Prince, Oliver,  H. (J-GA) March 3, 1829  292 Start of Initial    Senate Service Name/Party   End of Service Rank    15 December 10 ',
#  'Burnet, Jacob,  (Adams/AJ-OH) March 3, 1831  293 December 15 ',
#  'Iredell, James,  (J-NC) March 3, 1831  294  * * * 1829 * * * January 15 ',
#  'Dudley, Charles,  E. (J-NY) March 3, 1833  295  ',
#  'Holmes, John,  (Adams/AJ-ME) March 3, 1833   (First served 1820-1827) January 30 ',
#  'Dickerson, Mahlon,  (R/CR/J-NJ) March 3, 1833   (First served 1817-1829)']

# Step 3
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
[print(_) for _ in d]
# ('Silsbee, Nathaniel,', 'Adams/AJ-MA', 'March 3, 1835', '  281 November 8 ')
# ('Rodney, Daniel,', 'Adams-DE', 'January 12, 1827', '  282 November 9 ')
# ('Bateman, Ephraim,', 'Adams-NJ', 'January 12, 1829', '  283 November 27 ')
# ('McKinley, John,', 'J-AL', 'March 3, 1831', '  284   (Served again 1837) November 29 ')
# ('Smith, William,', 'R-SC', 'March 3, 1831', '   (First served 1816-1823)  * * * 1827 * * * January 12 ')
# ('Ridgely, Henry,  M.', 'J-DE', 'March 3, 1829', '  285   TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829  March 4 ')
# ('Barnard, Isaac,  D.', 'J-PA', 'December 6, 1831', '  286  ')
# ('Ellis, Powhatan,', 'J-MS', 'July 16, 1832', '   (First served 1825-1826)  ')
# ('Foot, Samuel,  A.', 'Adams/AJ-CT', 'March 3, 1833', '  287  ')
# ('McLane, Louis,', 'J-DE', 'April 16, 1829', '  288  ')
# ('Parris, Albion,  K.', 'J-ME', 'August 26, 1828', '  289  ')
# ('Tyler, John,', 'J/AJ-VA', 'February 29, 1836', '  290 December 17 ')
# ('Webster, Daniel,', 'Adams/AJ/W-MA', 'February 22, 1841', '  291   (Served again 1845)  * * * 1828 * * * November 7 ')
# ('Prince, Oliver,  H.', 'J-GA', 'March 3, 1829', '  292 Start of Initial    Senate Service Name/Party   End of Service Rank    15 December 10 ')
# ('Burnet, Jacob,', 'Adams/AJ-OH', 'March 3, 1831', '  293 December 15 ')
# ('Iredell, James,', 'J-NC', 'March 3, 1831', '  294  * * * 1829 * * * January 15 ')
# ('Dudley, Charles,  E.', 'J-NY', 'March 3, 1833', '  295  ')
# ('Holmes, John,', 'Adams/AJ-ME', 'March 3, 1833', '   (First served 1820-1827) January 30 ')
# ('Dickerson, Mahlon,', 'R/CR/J-NJ', 'March 3, 1833', '   (First served 1817-1829)')


# Step 4
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')

print(df)
#                     name    party/state    date_to_convert                                      rank_to_clean
# 0     Silsbee, Nathaniel    Adams/AJ-MA      March 3, 1835                                    281 November 8 
# 1         Rodney, Daniel       Adams-DE   January 12, 1827                                    282 November 9 
# 2       Bateman, Ephraim       Adams-NJ   January 12, 1829                                   283 November 27 
# 3         McKinley, John           J-AL      March 3, 1831             284   (Served again 1837) November 29 
# 4         Smith, William           R-SC      March 3, 1831     (First served 1816-1823)  * * * 1827 * * * ...
# 5    Ridgely, Henry,  M.           J-DE      March 3, 1829    285   TWENTIETH CONGRESS March 4, 1827, TO M...
# 6    Barnard, Isaac,  D.           J-PA   December 6, 1831                                              286  
# 7        Ellis, Powhatan           J-MS      July 16, 1832                         (First served 1825-1826)  
# 8      Foot, Samuel,  A.    Adams/AJ-CT      March 3, 1833                                              287  
# 9          McLane, Louis           J-DE     April 16, 1829                                              288  
# 10   Parris, Albion,  K.           J-ME    August 26, 1828                                              289  
# 11           Tyler, John        J/AJ-VA  February 29, 1836                                   290 December 17 
# 12       Webster, Daniel  Adams/AJ/W-MA  February 22, 1841    291   (Served again 1845)  * * * 1828 * * * ...
# 13   Prince, Oliver,  H.           J-GA      March 3, 1829    292 Start of Initial    Senate Service Name/...
# 14         Burnet, Jacob    Adams/AJ-OH      March 3, 1831                                   293 December 15 
# 15        Iredell, James           J-NC      March 3, 1831                  294  * * * 1829 * * * January 15 
# 16  Dudley, Charles,  E.           J-NY      March 3, 1833                                              295  
# 17          Holmes, John    Adams/AJ-ME      March 3, 1833               (First served 1820-1827) January 30 
# 18     Dickerson, Mahlon      R/CR/J-NJ      March 3, 1833                           (First served 1817-1829)

Upvotes: 1

Related Questions