Reputation: 21
I'm working on a Python project. I have a semicolon-plus-newline delimited text file that is being read containing all 50 states (including DC). Thus, each state has its own line terminating in a semicolon (;
). An example is below. I also have another file being read in with a LOT of information. The text document can be found here.
I want to skip any line that starts with the state name by testing it against a text file with all fifty states, along with the line below any such line. I do not need this information. Is there a way to test, line by line, if it starts with the state name and, if it matches with one of the fifty states in the other text file, skip that line plus the line below it?
For example, in the hyperlinked text file, line 43
starts with Alaska
. I want to skip that line and the line below it. I want to store the rest of the information in an array. When I hit line 244
, the information for the next state (Alabama
) starts. I want to skip line 244
and the line below that, and do the same thing - store all the information in the array, compiling one large array at the end.
Here are the first four lines of the fifty states file:
Alabama;
Alaska;
Arizona;
Arkansas;
For clarification, the only information I am only interested in is the ICAO data, which is the 3rd column in the hyperlinked text file.
Also, would it be an issue if there is no ICAO information for a specific line? For example, line 63
in the hyperlinked text document does not have a value.
This is the code I have so far:
import numpy as np
#This program reads in the ICAO data file found at: http://weather.rap.ucar.edu/surface/stations.txt
with open('ICAOlist.txt','r') as dataICAO:
icaoData = np.loadtxt(dataICAO, dtype = str, delimiter = ' ', skiprows = 41)
with open('listOfAllStates.txt', 'r') as dataStates:
statesData = np.loadtxt(dataStates, dtype = str, delimiter = ';')
Upvotes: 2
Views: 58
Reputation: 21239
I'm pretty sure this is just a matter of breaking down your concerns. First, you want to load your 'states name file' only once:
# Get all the states as an array
def load_states(statesFile):
with open(statesFile, 'r') as states:
return np.loadtxt(states, dtype = str, delimiter = ';')
Now, we need to go through every line in the ICAO data:
def load_icao_data(state_filename, icao_filename):
states = load_states(state_filename)
with open(icao_filename, 'r') as input:
previous_line = None
for line in input:
if valid_line(line, states) and valid_line(previous_line, states):
process_line(line)
previous_line = line
The two functions you would have to write are valid_line
(which should return a bool
) and process_line
(which should do whatever you need done with the data).
valid_line
should take a list of states along with the current line. It would look something like this:
def valid_line(line, states):
if not line or len(line) == 0:
return True # if the line is empty or None
for state in states:
if line.startswith(state):
return False
return True
process_line
is left for you to determine. Make sense?
Addendum:
I note in your actual data that state
isn't really the thing that determines if a line is 'bad'. You could rewrite valid_line
to:
def valid_line(line):
return len(line) > 3 # Eliminates short/empty lines
&& line[0] != '!' # Eliminates 'comment' lines
&& line[2] == ' ' # Eliminates 'state title' lines
&& line[3] != ' ' # Eliminates 'header column' line
Then your load_icao_data
becomes:
def load_icao_data(icao_filename):
with open(icao_filename, 'r') as input:
for line in input:
if valid_line(line):
process_line(line)
Upvotes: 1