Reputation: 38
So, I have a text file I tried read_csv
but the file structure doesn't allow for read_csv
to give desired output. I have been recommended to use open()
function. instead, but have been unable to successfully apply it yet!.
Text file : Textfile is from wikipedia of this structure
Now, I want a dataframe with two columns State and RegionName, but as they are vertically aligned I am not able to get desired output.
My Output: Just one column is filled and rest is NaN
Desired output is of the form:
{('Alabama','Auburn'),('Alabama','Florence'),('Alabama','Jacksonville') so on}
Upvotes: 0
Views: 285
Reputation: 478
Assuming all of your data is structured the same as in the photo provided, this should work:
import pandas as pd
df = pd.DataFrame(columns=["State", "RegionName"])
with open("example.txt") as f:
content = f.read()
lines = content.splitlines()
state_split_str = "[edit]"
region_split_str = "("
for line in lines:
if state_split_str in line:
state = line.split(state_split_str)[0].strip()
else:
region = line.split(region_split_str)[0].strip()
df = df.append({"State": state, "RegionName": region}, ignore_index=True)
print(df)
The above would result in:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alaska Fairbanks
This works because we know all schools in a state are preceded by a line containing the name of the state with "[edit]" beside it. We simply check to see if "[edit]" exists in the string. If it does, we split on "[edit]" and take everything to the left (the state name) via [0]
. For good measure we trim the new string of any leading/trailing whitespace. This provides us with our State. As we continue to iterate through each line, we know that if a line does not contain "[edit]" it is a school belonging to the current value of our "state" variable.
We then simply perform a similar split on the opening parenthesis of each string, and again take the first item of the list returned by split to obtain the RegionName, again trimming whitespace. Since we now have both our State and RegionName values, we can append them to our dataframe.
Is this an ideal solution? Not exactly, because we're hardcoding values into our code to split on/column names for appending, but due to the format of your data, I'm not sure there is a better solution.
Upvotes: 3