Sankalp Sharma
Sankalp Sharma

Reputation: 45

Splitting column containing state and region names in python

I am trying to create two separate columns from the following column in a data frame.

0                         State_1
1                            Auburn
2                          Florence
3                      Jacksonville
4                        Livingston
5                        Montevallo
6                              Troy
7                        Tuscaloosa
8                          Tuskegee
9                            state_2
10                        Fairbanks
11                          state_3
12                        Flagstaff
13                            Tempe
14                           Tucson
15                         state_4
16                      Arkadelphia
17                           Conway
18                     Fayetteville
19                        Jonesboro
20                         Magnolia
21                       Monticello
22                     Russellville
23                           Searcy

I want the above df to look something like this:

0  state_1                 Auburn
2  state_1                 Florence
3  state_1                 Jacksonville
4  state_1                Livingston
5  state_1                Montevallo
6  state_1                Troy
7  state_1                Tuscaloosa
8  state_1                Tuskegee
 ...

16 state_4                   Arkadelphia
17 state_4                   Conway
18 state_4                   Fayetteville
19 state_4                   Jonesboro
20 state_4                   Magnolia
21 state_4                   Monticello
22 state_4                   Russellville
23 v                         Searcy

As you can see, I want to sort of reverse pivot the data. I looked up the documentation on pd.pivot, but couldn't make any headway. Here is a dictionary of states:

states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

Here is a code I tried. Be warned, this is an embarrassingly bad attempt (pretty much a novice in Python here).

#create new column for states only
df['State'] = 0

#Duplicate above combined column
df['Column_duplicate'] = df['Column']

for i in range(len(df)):
    if (dfl['Column_duplicate'].iloc[i+1] == df['Column'].iloc[i]):
           dfl['State'].iloc[i] = dfl['Column'].iloc[i]

Upvotes: 0

Views: 363

Answers (2)

Sankalp Sharma
Sankalp Sharma

Reputation: 45

dfl = (pd.read_csv('university_towns.txt', sep="[|]|(|)", header=None).rename(columns={0:'datamain'}))
    
dfl = dfl['datamain'].str.split("(", n = 1, expand = True)
dfl = dfl.loc[:,[0]].rename(columns={0:'State'})
dfl['RegionName'] = dfl['State'].str.strip()
dfl['State'] = dfl['State'].str.replace(r"[.*\]","").str.strip()
dfl['RN1'] = dfl['RegionName'].str.contains(r"\[.*\]","")
    
    for i in range(len(dfl)):
        if dfl['RN1'].iloc[i] != True:
            dfl['State'].iloc[i] = np.NaN
            
    dfl = dfl.ffill(axis = 0)
    df1

Data from here: https://en.wikipedia.org/wiki/List_of_college_towns#College_towns_in_the_United_States

Please note this I'm sure is a rather arduous way of doing this. In summary: ffill() function is what I wanted to create the state column.

Upvotes: 0

n1colas.m
n1colas.m

Reputation: 3989

You can mask the rows containing state_ using where, and then use ffill() to populate the new colum with those values. After that, remove all rows with state_ on both columns.

import pandas as pd

df = pd.read_csv("data.txt", header=None)
print(df)

mark = df[0].where(df[0].str.contains("state_", case=False))
df[1] = mark.ffill()
df = df[df.iloc[:, 0] != df.iloc[:, 1]]

df.columns = ['State', 'StateNum']
df = df[df.columns[::-1]].reset_index(drop=True)

print(df)

Output from df

   StateNum         State
0   State_1        Auburn
1   State_1      Florence
2   State_1  Jacksonville
3   State_1    Livingston
4   State_1    Montevallo
5   State_1          Troy
6   State_1    Tuscaloosa
7   State_1      Tuskegee
8   state_2     Fairbanks
9   state_3     Flagstaff
10  state_3         Tempe
11  state_3        Tucson
12  state_4   Arkadelphia
13  state_4        Conway
14  state_4  Fayetteville
15  state_4     Jonesboro
16  state_4      Magnolia
17  state_4    Monticello
18  state_4  Russellville
19  state_4         Searc

Upvotes: 1

Related Questions