Edgar
Edgar

Reputation: 79

How to split a csv into multiple csv files using a list of keywords

I am trying to read performance reports from multiple machines and would like to parse them and combine them to easily compare machine performances on single plots. Once divided into multiple csv I plan on reading them using pd.read_csv() and combine mutliple tools into single df's.

But in order to that I must first deal with & split up rather ugly csv files with semicolon separators.
The structure of the CSV is like this:

KEYWORD_01;;;...;;
COL_01;COL_02;COL03;...;COL_n;
Line_1;
Line_2;
Line_3;
...
Line_m;
KEYWORD_02;;;...;;
COL_01;COL_02;COL03;...;COL_x;
Line_1;
Line_2;
Line_3;
...
Line_y;
KEYWORD_03;;;...;;
COL_01;COL_02;COL03;...;COL_f;
Line_1;
Line_2;
Line_3;
...
Line_g;

Data csv file available here

The csv report is made of multiple sections each beginning with a fixed keyword (or keyphrase), each section has a set number of column (which can vary from section to section) and a dynamic number of rows depending on the number of events reported (cfr structure above).

  1. I create a list with all my keywords called tpm_sections

    tpm_sections = ['Summary of time consumption',
        'Equipment Indicators',
        'Batch Profile',
        'Jam Profile',
        'Jam Time Profile',
        'Jam Table',
        'Handler Model profile',
        'Miscellaneous Indicators ',
        'Tape Job Profile ']
    tpm_idx = [None]*len(tpm_sections)
    
  2. I read my csv and use regex to match any element of my tpm_sections list to the rows of my csv file and I use the function enumerate so I can return the row index to in a separate list tpm_idx :

for file in os.listdir(input_folder):
   input_file=os.path.join(input_folder, file)
   if file.endswith('.csv'):
     tpm_date=datetime.fromtimestamp(os.path.getctime(input_file)).strftime('%Y%m%d') # get TPM report date from file creation timestamp
        with open(input_file, "r") as f: 
          
            reader = csv.reader(f, delimiter=";")
            #for line in reader:
            
            for i, row in enumerate(reader):
                if r'Machine' in row:
                    mcpat = re.compile(r'\\\\7icost\d\d')
                    mcline = str(row[1])
                    mcname = mcpat.match(mcline).group(0)[2:]
                    mcid = mcname[6:]
                    print('Report date is: ' + tpm_date + "\nMachine Name: " + mcname + '\nMachine ID: ' + mcid)
                for j in range(len(tpm_sections)):
                    if tpm_sections[j] in row:
                        tpm_idx[j] = i
                        print('Section '+tpm_sections[j]+' starts at line: ' + str(tpm_idx[j]) )
            tpm_dict = {tpm_idx_names[i]: tpm_idx[i] for i in range(len(tpm_idx))}
  1. I now have a list of keywords, a list of matching row indexes and a dictionary which links both, how should I proceed about splitting the csv file? My code to write csv file per section of my reader object for future pandas import, optional] create sub-folders per section for more structure

    for j in range(len(tpm_idx_names))
    output_file = tpm_date + mcname + tpm_idx_name[j]
    with open(output_file, 'w', newline='') as o:
        if j+1 < len(tpm_idx):
            #for row_idx in range(tpm_idx[j]:tpm_idx[j+1]):
            for line in reader[tpm_idx[j]:tpm_idx[j+1]]:
                o.write(''.join())
        else:
            for line in reader[tpm_idx[j]:]:
                o.write(''.join())
    
  2. Is there a simpler method of doing this by passing a list of keywords to the split() function? That would be awesome but I couldn't find any example of this being possible. Or by better using regex and then a while "line is not empty" loop? Bearing in mind an empty line in my csv is made of ;;;;;

  3. Should I instead create a list of list (LoL) using ls.append() or a numpy array; for every line in between matched tpm_section[j] keywords? Then I could easily add columns for my machine name and ID. I could choose to create a single LoL/array appending all of my 20 machines or else create one per machine and append them later either in pandas or before writing my csv. Code example to add in part 2:

elif tpm_sections[j] in row: TPM_LoL.j.append(row)

Upvotes: 0

Views: 755

Answers (1)

Grismar
Grismar

Reputation: 31329

I think you're making things harder by breaking up the problem into too many smaller problems. Extracting the data from the original html (being a structured data format of sorts as well) and only the data you need, would probably have been easiest.

However, if you're looking for a way to:

  • split up an existing text file into multiple text files
  • split just before a keyword line
  • only write output for selected keywords

And assuming the text file is a semi-colon-separated file for which any line that only has a single term in the first column is a keyword line, then this should work:

tpm_sections = [
    'Summary of time consumption',
    'Equipment Indicators',
    'Batch Profile',
    'Jam Profile',
    'Jam Time Profile',
    'Jam Table',
    'Handler Model profile',
    'Miscellaneous Indicators ',
    'Tape Job Profile '
]
out_f = None
with open('ICOST_19_TPM_20201124.csv') as f:
    for line in f:
        parts = line.strip().split(';')
        if parts[1] and (parts[1:].count('') == len(parts) - 1):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            if line[1] in tpm_sections:
                # naming the new file after the section
                out_f = open(f'{line[1]}.csv', 'w')
            else:
                out_f = None
        # for any line, if an output file is open at this point, write to it
        if out_f is not None:
            out_f.write(line)
    else:
        if out_f is not None:
            out_f.close()

If you don't want to recognise each line with only one value in the first column as a keyword line, but only want lines that have a recognised keyword to cause a split (and include everything after it in that file), you can simply change this:

        if parts[1] and (parts[1:].count('') == len(parts) - 1):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            if line[1] in tpm_sections:
                # naming the new file after the section
                out_f = open(f'{line[1]}.csv', 'w')
            else:
                out_f = None

To:

        if (parts[1] and (parts[1:].count('') == len(parts) - 1) and 
            (line[1] in tpm_sections)):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            out_f = open(f'{line[1]}.csv', 'w')

But it's not entirely clear from the question or the data which it should be. Either does as advertised.

Upvotes: 1

Related Questions