dogrocket
dogrocket

Reputation: 23

Looking for a strategy for parsing a file

I'm an experienced C programmer, but a complete python newbie. I'm learning python mostly for fun, and as a first exercise want to parse a text file, extracting the meaningful bits from the fluff, and ending up with a tab-delimited string of those bits in a different order.

I've had a blast plowing through tutorials and documentation and stackoverflow Q&As, merrily splitting strings and reading lines from files and etc. Now I think I'm at the point where I need a few road signs from experienced folks to avoid blind alleys.

Here's one chunk of the text I want to parse (you may recognize this as a McMaster order). The actual file will contain one or more chunks like this.

1   92351A603   Lag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5
Your Part Number: 7218-GYROID
22
packs   today
5.85
per pack     128.70

Note that the information is split over several lines in the file. I'd like to end up with a tab-delimited string that looks like this:

22\tpacks\tLag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5\t\t92351A603\t5.85\t\t128.70\t7218-GYROID\n

So I need to extract some parts of the string while ignoring others, rearrange them a bit, and re-pack them into a string.

Here's the (very early) code I have at the moment, it reads the file a line at a time, splits each line with delimiters, and I end up with several lists of strings, including a bunch of empty ones where there were double tabs:

import sys
import string

def split(delimiters, string, maxsplit=0):
    """Split the given string with the given delimiters (an array of strings)
    This function lifted from stackoverflow in a post by Kos"""
    import re
    regexPattern = '|'.join(map(re.escape, delimiters))
    return re.split(regexPattern, string, maxsplit)

delimiters = "\t", "\n", "\r", "Your Part Number: "
with open(sys.argv[1], 'r') as f:
    for line in f:
        print(split( delimiters, line))

f.close()

Question 1 is basic: how can I remove the empty strings from my lists, then mash all the strings together into one list? In C I'd loop through all the lists, ignoring the empties and sticking the other strings in a new list. But I have a feeling python has a more elegant way to do this sort of thing.

Question 2 is more open ended: what's a robust strategy here? Should I read more than one line at a time in the first place? Make a dictionary, allowing easier re-ordering of the items later?

Sorry for the novel. Thanks for any pointers. And please, stylistic comments are more than welcome, style matters.

Upvotes: 2

Views: 573

Answers (2)

mpenkov
mpenkov

Reputation: 21906

You can remove empty strings by:

new_list = filter(None, old_list)

Replace the first parameter with a lambda expression that is True for elements you want to keep. Passing None is equivalent to lambda x: x.

You can mash strings together into one string using:

a_string = "".join(list_of_strings)

If you have several lists (of whatever) and you want to join them together into one list, then:

new_list = reduce(lambda x, y: x+y, old_list)

That will simply concatenate them, but you can use any non-empty string as the separator.

If you're new to Python, then functions like filter and reduce (EDIT: deprecated in Python 3) may seem a bit alien, but they save a lot of time coding, so it's worth getting to know them.

I think you're on the right track to solving your problem. I'd do this:

  • break up everything into lines
  • break the resulting list into smaller list, one list per order
  • parse the orders into "something meaningful"
  • sort, output the result

Personally, I'd make a class to handle the last two parts (they kind of belong together logically) but you could get by without it.

Upvotes: 0

Kabie
Kabie

Reputation: 10673

You don't need to close file when using with.

And if I were to implement this. I might use a big regex to extract parts from each chunk(with finditer), and reassemble them for output.

Upvotes: 1

Related Questions