Python - How to parse smartctl program output?

Question

I am writing a wrapper for smartctl in python 2.7.3...

I am having a hell of a time trying to wrap my head around how to parse the output from the smartctl program in Linux (Ubuntu x64 to be specific)

I am running smartctl -l selftest /dev/sdx via subprocess and grabbing the output into a variable

This variable is broken up into a list, then I drop the useless header data and blank lines from the output.

Now, I am left with a list of strings, which is great!

The data is sort-of tabular, and I want to parse it into a dict() full of lists (I think this is the correct way to represent tabular data in Python from reading the docs)

Here's a sample of the data:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     44796         -
# 2  Short offline       Completed without error       00%     44796         -
# 3  Short offline       Completed without error       00%     44796         -
# 4  Short offline       Completed without error       00%     44796         -
# 5  Short offline       Completed without error       00%     44796         -
# 6  Extended offline    Completed without error       00%     44771         -
# 7  Short offline       Completed without error       00%     44771         -
# 8  Short offline       Completed without error       00%     44741         -
# 9  Short offline       Completed without error       00%         1         -
#10  Short offline       Self-test routine in progress 70%     44813         -

I can see some issues with trying to parse this, and am open to solutions, but i may also just be doing this all wrong ;-):

The Status Text Self-test routine in progress flows past the first character of the text Remaining
In the Num column, the numbers after 9 are not separated from the # character by a space

I might be way off-base here, but this is my first time trying to parse something this eccentric.

Thank everyone who even bothers to read this wall of text in advance!!!

Here's my code so far, if anyone feels it necessary or finds it useful:

#testStatus.py

#This module provides an interface for retrieving
#test status and results for ongoing and completed
#drive tests

import subprocess

#this function takes a list of strings and removes
#strings which do not have pertinent information
def cleanOutput(data):

  cleanedOutput = []

  del data[0:3] #This deletes records 0-3 (lines 1-4) from the list

  for item in data:
    if item == '': #This removes blank items from remaining list
      pass
    else:
      cleanedOutput.append(item)

  return cleanedOutput


def resultsOutput(data):

  headerLines = []
  resultsLines = []
  resultsTable = {}

  for line in data:
    if "START OF READ" in line or "log structure revision" in line:
      headerLines.append(line)
    else:
      resultsLines.append(line)

  nameLine = resultsLines[0].split()
  print nameLine


def getStatus(sdxPath):
  try:
    output = subprocess.check_output(["smartctl", "-l", "selftest", sdxPath])

  except subprocess.CalledProcessError:
    print ("smartctl command failed...")

  except Exception as e:
    print (e)

  splitOutput = output.split('
')

  cleanedOutput = cleanOutput(splitOutput)

  resultsOutput(cleanedOutput)


#For Testing
getStatus("/dev/sdb")

mgkrebbs · Accepted Answer

The main parsing problem seems to be the first three columns; the remaining data is more straight forward. Assuming the output uses blanks between fields (instead of tab characters, which would be much easier to parse), I'd go for fixed length parsing, something like:

num = line[1:2]
desc = line[5:25]
status = line[25:54]
remain = line[54:58]
lifetime = line[60:68]
lba = line[77:99]

The header line would be handled differently. What structure you put the data into depends on what you want to do with it. A dictionary keyed by "num" might be appropriate if you mainly wanted to randomly access data by that "num" identifier. Otherwise a list might be better. Each entry (per line) could be a tuple, a list, a dictionary, a class instance, or maybe other things. If you want to access fields by name, then a dictionary or class instance per entry might be appropriate.

Python - How to parse smartctl program output?

Answers (2)

Related Questions