Brandon Molyneaux
Brandon Molyneaux

Reputation: 21

Comparing two text documents and skipping certain lines based off of one text document - Python

I'm working on a Python project. I have a semicolon-plus-newline delimited text file that is being read containing all 50 states (including DC). Thus, each state has its own line terminating in a semicolon (;). An example is below. I also have another file being read in with a LOT of information. The text document can be found here.

I want to skip any line that starts with the state name by testing it against a text file with all fifty states, along with the line below any such line. I do not need this information. Is there a way to test, line by line, if it starts with the state name and, if it matches with one of the fifty states in the other text file, skip that line plus the line below it?

For example, in the hyperlinked text file, line 43 starts with Alaska. I want to skip that line and the line below it. I want to store the rest of the information in an array. When I hit line 244, the information for the next state (Alabama) starts. I want to skip line 244 and the line below that, and do the same thing - store all the information in the array, compiling one large array at the end.

Here are the first four lines of the fifty states file:

Alabama; 
Alaska;
Arizona; 
Arkansas;

For clarification, the only information I am only interested in is the ICAO data, which is the 3rd column in the hyperlinked text file.

Also, would it be an issue if there is no ICAO information for a specific line? For example, line 63 in the hyperlinked text document does not have a value.

This is the code I have so far:

import numpy as np
#This program reads in the ICAO data file found at: http://weather.rap.ucar.edu/surface/stations.txt

with open('ICAOlist.txt','r') as dataICAO:
     icaoData = np.loadtxt(dataICAO, dtype = str, delimiter = ' ', skiprows = 41)
     with open('listOfAllStates.txt', 'r') as dataStates:
         statesData = np.loadtxt(dataStates, dtype = str, delimiter = ';')

Upvotes: 2

Views: 58

Answers (1)

Nathaniel Ford
Nathaniel Ford

Reputation: 21239

I'm pretty sure this is just a matter of breaking down your concerns. First, you want to load your 'states name file' only once:

# Get all the states as an array
def load_states(statesFile):
    with open(statesFile, 'r') as states:
        return np.loadtxt(states, dtype = str, delimiter = ';') 

Now, we need to go through every line in the ICAO data:

def load_icao_data(state_filename, icao_filename):
    states = load_states(state_filename)
    with open(icao_filename, 'r') as input:
        previous_line = None
        for line in input:
            if valid_line(line, states) and valid_line(previous_line, states):
                process_line(line)
            previous_line = line

The two functions you would have to write are valid_line (which should return a bool) and process_line (which should do whatever you need done with the data).

valid_line should take a list of states along with the current line. It would look something like this:

def valid_line(line, states):
    if not line or len(line) == 0:
        return True  # if the line is empty or None
    for state in states:
        if line.startswith(state):
            return False
    return True

process_line is left for you to determine. Make sense?

Addendum:

I note in your actual data that state isn't really the thing that determines if a line is 'bad'. You could rewrite valid_line to:

def valid_line(line):
    return len(line) > 3   # Eliminates short/empty lines
        && line[0] != '!'  # Eliminates 'comment' lines
        && line[2] == ' '  # Eliminates 'state title' lines
        && line[3] != ' '  # Eliminates 'header column' line

Then your load_icao_data becomes:

def load_icao_data(icao_filename):
    with open(icao_filename, 'r') as input:
        for line in input:
            if valid_line(line):
                process_line(line)

Upvotes: 1

Related Questions