Reputation: 99

Iterating through multiple files, extract date to another file

Ok I have a source directory which has multiple folders. Each folder has a file named tvshow.nfo from which I want to extract data. I wrote the following -

import sys
import os
import re
from pathlib import Path

L = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    M = str(item).splitlines()
    for i in M:
        f = open(i, "r")
        for i in f:
            for j in re.findall("<title>(.+)</title>", i):
                L.append(j)
            for j in re.findall("<year>(.+)</year>", i):
                L.append(j)
            for j in re.findall("<status>(.+)</status>", i):
                L.append(j)
            for j in re.findall("<studio>(.+)</studio>", i):
                L.append(j)
        for i in L:
            print (i)
        f.close()

I used glob to geth the exact paths of all nfos, then used splitlines to separate each path , iterated through file at each of those paths, then used regex to extract info. And tried to append this info to the empty List. I get the following output -

APB
2017
Continuing
FOX (US)
APB
2017
Continuing
FOX (US)
Angie Tribeca
2016
Continuing
TBS
APB
2017
Continuing
FOX (US)
Angie Tribeca
2016
Continuing
TBS
Arrow
2012
Continuing
The CW
['APB', '2017', 'Continuing', 'FOX (US)', 'Angie Tribeca', '2016', 'Continuing', 'TBS', 'Arrow', '2012', 'Continuing', 'The CW']

I want the output exported to a new file as:

APB 2017 Continuing FOX (US)
Angie Tribeca 2016 Continuing TBS
Arrow 2012 Continuing The CW

Can anyone help me? Also is there a better way to do this than the one I attempted?

Upvotes: 0

Answers (3)

Josh Rhoads

Reputation: 11

From your example it looks like all of the tags for each show are on one line.

If all of the tags for a show are on one line I think something like this might help:

import sys
import os
import re
from pathlib import Path


def find_tag(tag, l):
    ''' returns result of findall on a tag on line l'''
    full_tag = "<" + tag + ">(.+)</" + tag + ">"
    return re.findall(full_tag, l)


L = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    # changed the file variable to data_file
    M = str(item).splitlines()
    for data_file in M:
        # use with to open the file without needing to close it
        with open(data_file, "r") as f:

            for line in f:
                title = find_tag("title", line)
                year = find_tag("year", line)
                status = find_tag("status", line)
                studio = find_tag("studio", line)
                L.append(' '.join(str(d[0]) for d in [title, year, status, studio] if d))

# print the data or whatever else you're doing with it
for data in L:
    print(data)

This uses with to open the file without needing to use a try-catch and close it yourself. Information about with can be found here: file methods

str(d[0]) is needed to change the group list item from re.findall into a string. The if d is there in case a tag is missing on that line (and it's possible I'm misunderstanding how the tags are placed within the file, sorry about that if I am)

It's also possible to build L with a list comprehension: L = [(find_tag("title", line), find_tag("year", line), find_tag("status", line), find_tag("studio", line)) for line in f] instead of appending to the list.

The join method could then be used when printing the list: print(' '.join(str(d[0]) for d in data if d)).

whether or not you want to do that depends on how much you like list comprehensions.

I also created a find_tag function, but that's mostly from me trying to figure out what was going on.

Without knowing what the file looks like it's hard to tell if you should be looking for each one on a separate line. It's also hard to tell if the order matters or if you need to do any error handling.

Upvotes: 0

Simon Bilsky-Rollins

Reputation: 545

Instead of making one list with all of the different attributes for each show, you should structure your data in a more easily readable way. One possibility is a list of lists, where the top-level list has one entry for each show and the inner lists contain the title, year, status, and studio attributes for one show. You can modify your existing code quite easily to accomplish this:

    for i in f:
        show_attributes = []
        for j in re.findall("<title>(.+)</title>", i):
            show_attributes.append(j)
        for j in re.findall("<year>(.+)</year>", i):
            show_attributes.append(j)
        for j in re.findall("<status>(.+)</status>", i):
            show_attributes.append(j)
        for j in re.findall("<studio>(.+)</studio>", i):
            show_attributes.append(j)
        L.append(show_attributes)
    for i in L:
        for attribute in i:
            print(attribute, end=' ')
    f.close()

Upvotes: 0

Shiping

Reputation: 1337

Based on what you showed, you may try this.

import sys
import os
import re
from pathlib import Path

info = []
my_dir = "./source/"
for item in Path(my_dir).glob('./*/tvshow.nfo'):
    M = str(item).splitlines()
    for i in M:
        L = []
        f = open(i, "r")
        for i in f:
            for j in re.findall("<title>(.+)</title>", i):
                L.append(j)
            for j in re.findall("<year>(.+)</year>", i):
                L.append(j)
            for j in re.findall("<status>(.+)</status>", i):
                L.append(j)
            for j in re.findall("<studio>(.+)</studio>", i):
                L.append(j)
        f.close()
        info.append(' '.join(L))
with open("new_file", "w") as w:
    for i in info:
        w.write(i + "\n")

Upvotes: 1

Iterating through multiple files, extract date to another file

Answers (3)

Related Questions