How to extract contents of multiple text files into a pandas dataframe using Python?

Question

I have 2 text files that contain contents like below :


/*foo1.txt*/

Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVMi
Number of bins (b): 10
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
     #Atts. considered as irrelevant: 0
     Data size: 900; Query size: 100
     Dimensionality of the space: 230
     ... using Test SVM for SVM ...
     ... Equal Frequency discretisation (b=10) ...
     Max. num. of bins: 10, Min. num. of bins: 10
SVM Classification accuarcy scores (C=0.1): 0.5300
SVM Classification accuarcy scores (C=0.5): 0.6300
SVM Classification accuarcy scores (C=10): 0.7300
SVM Classification accuarcy scores (C=100): 0.7300
Done!
Total runtime: 6.8169 second.

/*foo2.txt*/

Number of data records: 1000
Number of attributes: 231
Class attribute index: 231
Monotonic Transformation: None
Number of class labels: 10
Number of folds: 10
Test fold: 1
Random seed: 0
(Dis)similarity measure: Test_SVM
Task: SVM
Number of bins (b): 30
Histogram type: EF
Number of trees (T): 0 (For tree-based methods.)
Sample size (W): 0 (For tree-based methods.)
Running Experiment... Please wait...
     #Atts. considered as irrelevant: 0
     Data size: 900; Query size: 100
     Dimensionality of the space: 230
     ... using Test SVM for SVM ...
     ... Equal Frequency discretisation (b=30) ...
     Max. num. of bins: 30, Min. num. of bins: 30
SVM Classification accuarcy scores (C=0.1): 0.6600
SVM Classification accuarcy scores (C=0.5): 0.7400
SVM Classification accuarcy scores (C=10): 0.8000
SVM Classification accuarcy scores (C=100): 0.8000
Done!
Total runtime: 8.2947 second.

The goal is to fetch the contents of two text files (.txt files foo1 and foo2) into a pandas dataframe df that should look like below.

How can I fetch the values like in the mentioned above dataframe ?

EDIT - As the structure of the text in the actual txt files was different, hence editing the question to reflect the data in actual text files.

Shubham Sharma · Accepted Answer

Update (based on the text files you shared in the comments and as per the discussion)

Using a regular expression pattern extract the relevant sections from the text contents of the file, then using another regex pattern find all col-value value pairs and map these pairs to the dictionary in order to create records. Note: I assumed data as the folder which contains the text files, you can replace it with your actual folder.

import re
from pathlib import Path

def read_files():
    for file in Path('data').glob('*.txt'):
        data = file.open().read()
        m = re.search(r'(.*?)Running Exp.*?(?=SVM Class)(.*?)Done!', data, re.DOTALL)
        c = re.findall(r'^(.*?)\s*:\s*(.*?)\s*(?:\(|$)', m.group(1), re.MULTILINE)
        yield {**dict(c), 'Results': m.group(2).strip()}

df = pd.DataFrame(read_files())

  Number of data records Number of attributes Class attribute index Monotonic Transformation Number of class labels Number of folds Test fold Random seed  Task Number of bins (b) Histogram type Number of trees (T) Sample size (W)                                                                                                                                                                                                        Results
0                   1000                  231                   231                     None                     10              10         1           0   SVM                 30             EF                   0               0  SVM Classification accuarcy scores (C=0.1): 0.6600\nSVM Classification accuarcy scores (C=0.5): 0.7400\nSVM Classification accuarcy scores (C=10): 0.8000\nSVM Classification accuarcy scores (C=100): 0.8000
1                   1000                  231                   231                     None                     10              10         1           0  SVMi                 10             EF                   0               0  SVM Classification accuarcy scores (C=0.1): 0.5300\nSVM Classification accuarcy scores (C=0.5): 0.6300\nSVM Classification accuarcy scores (C=10): 0.7300\nSVM Classification accuarcy scores (C=100): 0.7300

How to extract contents of multiple text files into a pandas dataframe using Python?

Answers (1)

Related Questions