Reputation: 37
I am trying to build a data frame out of a .txt file. The .txt file (which is a parameter I use in another program) looks like this:
a('s1','0')=1;
a('s2','0')=1;
a('s2','18')=1;
a('s3','0')=1;
a('s3','19')=1;
a('s3','21')=1;
a('s4','0')=1;
a('s4','20')=1;
Where s1, s2, s3... are routes, so route s1, route s2, route s3... for flights. And the following, '0', '18', '19' are id's of flights. So basically, route s1 is just flight '0', but route s3 goes first flight '0' followed by flight '19' and finally flight '21'.
What I want to extract from this txt file is a dataframe containing the route (s1, s2, s...) and the first and last flight operated in that route. So if route s10 has 5 flights, I only want the first and last flight of the route.
Is there a way in python or maybe pandas?
Upvotes: 0
Views: 127
Reputation: 5372
It looks like an easy task for regex:
import re
import pandas as pd
extract_route = re.compile(r"a\('(\w+)','(\d+)'\)=1;")
with open('data.txt', 'r') as f:
routes = f.readlines()
df = pd.DataFrame([extract_route.split(route)[1:3] for route in routes],
columns=['Route', 'FlightID'])
Complementing the answer, you can now group the series by 'Route':
groups = df.groupby('Route')
With the groups in place, you can get the first and last flights using:
>>> groups.first()
FlightID
Route
s1 0
s2 0
s3 0
s4 0
>>> groups.last()
FlightID
Route
s1 0
s2 18
s3 21
s4 20
>>>
The main advantage of this approach is that you do not lose the intermediate flights information, in case you need them in the future.
Upvotes: 1
Reputation: 8521
Can you try the following:
import pandas as pd
with open('data.txt', 'r') as infile:
data = infile.read()
df = pd.DataFrame([i[i.find("(")+1:i.find(")")].replace("'", '').split(',')
for i in data.split('\n')], columns=['route', 'flight_id'])
print(df)
Output:
route flight_id
0 s1 0
1 s2 0
2 s2 18
3 s3 0
4 s3 19
5 s4 0
6 s4 20
Upvotes: 0