Himanshu agarwal
Himanshu agarwal

Reputation: 38

Convert textfile into dataframe, but text file structure is messy!( In Python)

So, I have a text file I tried read_csv but the file structure doesn't allow for read_csv to give desired output. I have been recommended to use open() function. instead, but have been unable to successfully apply it yet!.

Text file : Textfile is from wikipedia of this structure

Now, I want a dataframe with two columns State and RegionName, but as they are vertically aligned I am not able to get desired output.

My Output: Just one column is filled and rest is NaN

Desired output is of the form:

{('Alabama','Auburn'),('Alabama','Florence'),('Alabama','Jacksonville') so on}

Upvotes: 0

Views: 285

Answers (1)

Rexovas
Rexovas

Reputation: 478

Assuming all of your data is structured the same as in the photo provided, this should work:

import pandas as pd


df = pd.DataFrame(columns=["State", "RegionName"])

with open("example.txt") as f:
    content = f.read()
    lines = content.splitlines()
    state_split_str = "[edit]"
    region_split_str = "("
    for line in lines:
        if state_split_str in line:
            state = line.split(state_split_str)[0].strip()
        else:
            region = line.split(region_split_str)[0].strip()
            df = df.append({"State": state, "RegionName": region}, ignore_index=True)

print(df)

The above would result in:

     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3   Alaska     Fairbanks

This works because we know all schools in a state are preceded by a line containing the name of the state with "[edit]" beside it. We simply check to see if "[edit]" exists in the string. If it does, we split on "[edit]" and take everything to the left (the state name) via [0]. For good measure we trim the new string of any leading/trailing whitespace. This provides us with our State. As we continue to iterate through each line, we know that if a line does not contain "[edit]" it is a school belonging to the current value of our "state" variable.

We then simply perform a similar split on the opening parenthesis of each string, and again take the first item of the list returned by split to obtain the RegionName, again trimming whitespace. Since we now have both our State and RegionName values, we can append them to our dataframe.

Is this an ideal solution? Not exactly, because we're hardcoding values into our code to split on/column names for appending, but due to the format of your data, I'm not sure there is a better solution.

Upvotes: 3

Related Questions