Reputation: 53
I am working with this PDF document of about 80 pages. It lists all 1,984 US senators from US history in chronological order. I have extracted the text of the document using PyPDF2. The text is now assigned to a variable as a single, long string. Here is a segment:
Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)
Notice that the name, party affiliation, state, end of service date, and rank of each senator normally appear in a patterned segment. Here are some examples:
Rodney, Daniel (Adams-DE) January 12, 1827 282
Bateman, Ephraim (Adams-NJ) January 12, 1829 283
Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293
But there are also some exceptions, such as these:
Smith, William (R-SC) March 3, 1831 (First served 1816-1823)
Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827)
In these cases the rank is given when the senator is first listed.
My question is, how can I extract the basic information on each senator (name, party, state, end of service, rank)? I believe I need to loop through the string, finding all instances of a regular expression that captures the patterns, and assign each instance to a list within a list. The end result would be a list of lists that I could transform into a dataframe in pandas.
Upvotes: 0
Views: 70
Reputation: 5502
You can try the following approach::
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
Workflow:
Split the input on ,
surrounded by two Names:
[a-zA-Z]+\,\s+[a-zA-Z]+
re.split
Here, we have all the lines bu divided in two elements. We need to aggregate two consecutive element. The topic Create a 2D list out of 1D list answer this step.
Now content can be extracted from each row. Here, we use re.findall
with regex (.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$
. They are 4 groups:
(.*?)\s+\(
(.*?)\)
(.*?\d{4})
(.*?)$
For a better understanding of regex, I advice you to see online regex such as regex101.com to visualize the results...
Next steps, apply more specific cleanings and separation on dataset such as removing comma on name with:
df["name"] = df["name"].str.replace(r'\,$','')
Code + illustration
# import module
import pandas as pd
import re
d = "Silsbee, Nathaniel (Adams/AJ-MA) March 3, 1835 281 November 8 Rodney, Daniel (Adams-DE) January 12, 1827 282 November 9 Bateman, Ephraim (Adams-NJ) January 12, 1829 283 November 27 McKinley, John (J-AL) March 3, 1831 284 (Served again 1837) November 29 Smith, William (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 Ridgely, Henry M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 Barnard, Isaac D. (J-PA) December 6, 1831 286 Ellis, Powhatan (J-MS) July 16, 1832 (First served 1825-1826) Foot, Samuel A. (Adams/AJ-CT) March 3, 1833 287 McLane, Louis (J-DE) April 16, 1829 288 Parris, Albion K. (J-ME) August 26, 1828 289 Tyler, John (J/AJ-VA) February 29, 1836 290 December 17 Webster, Daniel (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 Prince, Oliver H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 Burnet, Jacob (Adams/AJ-OH) March 3, 1831 293 December 15 Iredell, James (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 Dudley, Charles E. (J-NY) March 3, 1833 295 Holmes, John (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 Dickerson, Mahlon (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)"
# Step 1
d = re.split(r"([a-zA-Z]+\,\s+[a-zA-Z]+)", d)[1:]
print(d)
# ['Silsbee, Nathaniel', ' (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel', ' (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim', ' (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John', ' (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William', ' (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry', ' M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac', ' D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan', ' (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel', ' A. (Adams/AJ-CT) March 3, 1833 287 ',
# 'McLane, Louis', ' (J-DE) April 16, 1829 288 ',
# 'Parris, Albion', ' K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John', ' (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel', ' (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver', ' H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob', ' (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James', ' (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles', ' E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John', ' (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon', ' (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 2
d = [", ".join(d[i:i+2]) for i in range(0, len(d), 2)]
print(d)
# ['Silsbee, Nathaniel, (Adams/AJ-MA) March 3, 1835 281 November 8 ',
# 'Rodney, Daniel, (Adams-DE) January 12, 1827 282 November 9 ',
# 'Bateman, Ephraim, (Adams-NJ) January 12, 1829 283 November 27 ',
# 'McKinley, John, (J-AL) March 3, 1831 284 (Served again 1837) November 29 ',
# 'Smith, William, (R-SC) March 3, 1831 (First served 1816-1823) * * * 1827 * * * January 12 ',
# 'Ridgely, Henry, M. (J-DE) March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ',
# 'Barnard, Isaac, D. (J-PA) December 6, 1831 286 ',
# 'Ellis, Powhatan, (J-MS) July 16, 1832 (First served 1825-1826) ',
# 'Foot, Samuel, A. (Adams/AJ-CT) March 3, 1833 287 ', 'McLane, Louis, (J-DE) April 16, 1829 288 ',
# 'Parris, Albion, K. (J-ME) August 26, 1828 289 ',
# 'Tyler, John, (J/AJ-VA) February 29, 1836 290 December 17 ',
# 'Webster, Daniel, (Adams/AJ/W-MA) February 22, 1841 291 (Served again 1845) * * * 1828 * * * November 7 ',
# 'Prince, Oliver, H. (J-GA) March 3, 1829 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ',
# 'Burnet, Jacob, (Adams/AJ-OH) March 3, 1831 293 December 15 ',
# 'Iredell, James, (J-NC) March 3, 1831 294 * * * 1829 * * * January 15 ',
# 'Dudley, Charles, E. (J-NY) March 3, 1833 295 ',
# 'Holmes, John, (Adams/AJ-ME) March 3, 1833 (First served 1820-1827) January 30 ',
# 'Dickerson, Mahlon, (R/CR/J-NJ) March 3, 1833 (First served 1817-1829)']
# Step 3
d = [re.findall(r'^(.*?)\s+\((.*?)\)\s+(.*?\d{4})(.*?)$', _)[0] for _ in d]
[print(_) for _ in d]
# ('Silsbee, Nathaniel,', 'Adams/AJ-MA', 'March 3, 1835', ' 281 November 8 ')
# ('Rodney, Daniel,', 'Adams-DE', 'January 12, 1827', ' 282 November 9 ')
# ('Bateman, Ephraim,', 'Adams-NJ', 'January 12, 1829', ' 283 November 27 ')
# ('McKinley, John,', 'J-AL', 'March 3, 1831', ' 284 (Served again 1837) November 29 ')
# ('Smith, William,', 'R-SC', 'March 3, 1831', ' (First served 1816-1823) * * * 1827 * * * January 12 ')
# ('Ridgely, Henry, M.', 'J-DE', 'March 3, 1829', ' 285 TWENTIETH CONGRESS March 4, 1827, TO MARCH 3, 1829 March 4 ')
# ('Barnard, Isaac, D.', 'J-PA', 'December 6, 1831', ' 286 ')
# ('Ellis, Powhatan,', 'J-MS', 'July 16, 1832', ' (First served 1825-1826) ')
# ('Foot, Samuel, A.', 'Adams/AJ-CT', 'March 3, 1833', ' 287 ')
# ('McLane, Louis,', 'J-DE', 'April 16, 1829', ' 288 ')
# ('Parris, Albion, K.', 'J-ME', 'August 26, 1828', ' 289 ')
# ('Tyler, John,', 'J/AJ-VA', 'February 29, 1836', ' 290 December 17 ')
# ('Webster, Daniel,', 'Adams/AJ/W-MA', 'February 22, 1841', ' 291 (Served again 1845) * * * 1828 * * * November 7 ')
# ('Prince, Oliver, H.', 'J-GA', 'March 3, 1829', ' 292 Start of Initial Senate Service Name/Party End of Service Rank 15 December 10 ')
# ('Burnet, Jacob,', 'Adams/AJ-OH', 'March 3, 1831', ' 293 December 15 ')
# ('Iredell, James,', 'J-NC', 'March 3, 1831', ' 294 * * * 1829 * * * January 15 ')
# ('Dudley, Charles, E.', 'J-NY', 'March 3, 1833', ' 295 ')
# ('Holmes, John,', 'Adams/AJ-ME', 'March 3, 1833', ' (First served 1820-1827) January 30 ')
# ('Dickerson, Mahlon,', 'R/CR/J-NJ', 'March 3, 1833', ' (First served 1817-1829)')
# Step 4
df = pd.DataFrame(d, columns=["name", "party/state", "date_to_convert", "rank_to_clean"])
df["name"] = df["name"].str.replace(r'\,$','')
print(df)
# name party/state date_to_convert rank_to_clean
# 0 Silsbee, Nathaniel Adams/AJ-MA March 3, 1835 281 November 8
# 1 Rodney, Daniel Adams-DE January 12, 1827 282 November 9
# 2 Bateman, Ephraim Adams-NJ January 12, 1829 283 November 27
# 3 McKinley, John J-AL March 3, 1831 284 (Served again 1837) November 29
# 4 Smith, William R-SC March 3, 1831 (First served 1816-1823) * * * 1827 * * * ...
# 5 Ridgely, Henry, M. J-DE March 3, 1829 285 TWENTIETH CONGRESS March 4, 1827, TO M...
# 6 Barnard, Isaac, D. J-PA December 6, 1831 286
# 7 Ellis, Powhatan J-MS July 16, 1832 (First served 1825-1826)
# 8 Foot, Samuel, A. Adams/AJ-CT March 3, 1833 287
# 9 McLane, Louis J-DE April 16, 1829 288
# 10 Parris, Albion, K. J-ME August 26, 1828 289
# 11 Tyler, John J/AJ-VA February 29, 1836 290 December 17
# 12 Webster, Daniel Adams/AJ/W-MA February 22, 1841 291 (Served again 1845) * * * 1828 * * * ...
# 13 Prince, Oliver, H. J-GA March 3, 1829 292 Start of Initial Senate Service Name/...
# 14 Burnet, Jacob Adams/AJ-OH March 3, 1831 293 December 15
# 15 Iredell, James J-NC March 3, 1831 294 * * * 1829 * * * January 15
# 16 Dudley, Charles, E. J-NY March 3, 1833 295
# 17 Holmes, John Adams/AJ-ME March 3, 1833 (First served 1820-1827) January 30
# 18 Dickerson, Mahlon R/CR/J-NJ March 3, 1833 (First served 1817-1829)
Upvotes: 1