Việt Thắng
Việt Thắng

Reputation: 1

How to extract data using Python

I am extracting a string line to put it into csv with precise columns. The data looks like this:

Andrew Taggart 12345678 Math: 90 English: 78 Physics: 85

Jame Bond 1012478 English: 97 Physics: 85 Chemistry: 76

Hope Williams 1478978 Math: 89 English: 85 Physics: 76

and I want the output look like this

Name, Student_ID, Math, English, Physics, Chemistry

Andrew Taggart, 12345678, 90, 78, 85, -1

Jame Bond, 1012478, -1, 97, 85, 76

Hope Williams, 1478978, 89, 85, 76, -1

the format will be:

name|subject1|subject2|subject3

name1|23|34|23

name2|3|2|5

Note that after each columns there will be specified by "," and if student don't have grade on that subject it will be (-1) - like it is for Chemistry.

Here is my code so far.

import re
student = "Andrew Taggart 12345678 Math: 90 English: 78 Physics: 85"
        
name = re.split (r'(\d+)', student) [0] #extract name 
-> Output: Andrew Taggart
ID = re.split (r'(\d+)', student) [1] #extract ID 
-> Output: 12345678 

headers = ["Math", "English", "Physics", "Chemistry"] 
grade = re.findall(r"[-+]?\d*\.\d+|\d+", student) [1:] 
-> Output: ["90", "78", "85"]
student_grade = list (i.strip().replace(":", "") for i in re.split (r"(\d+.\d+|\d+')", student)) [2:-1] 
-> Output: ["Math", "90", "English", "78", "Physics", "85"]

Upvotes: 0

Views: 226

Answers (1)

Capie
Capie

Reputation: 996

This does not return it the way you want, but may help.

formatted = {}
for word in data.split():
    try:
        num = int(word)
        formatted.update({key: num})
        key=""
    except ValueError as ex:
        if key == "":
            key = word
        else: 
            key += f" {word}"

And the output is such as:

{'Andrew Taggart': 12345678, 'Math:': 90, 'English:': 78, 'Physics:': 85}

You can easily use this format to create a csv.

----EDIT------

result = []
 data = """Andrew Taggart 12345678 Math: 90 English: 78 Physics: 85
 Jame Bond 1012478 English: 97 Physics: 85 Chemistry: 76
 Hope Williams 1478978 Math: 89 English: 85 Physics: 76"""
 for line in data.split("\n"):
     student = True
     form = {}
     for word in line.split():
         try:
             num = int(word)
             if student:
                 form.update({"name": key, "student_id": num})
                 student, key = False, ""
                 continue
             form.update({key: num})
             key=""        
         except ValueError as ex:
             if key == "":
                 key = word
                 continue
             key += f" {word}"
     result.append(form)

Then just do:

import pandas as pd
pd.DataFrame(result).fillna(-1)

And you get:

             name  student_id  Math:  English:  Physics:  Chemistry:
0  Andrew Taggart    12345678   90.0        78        85        -1.0
1       Jame Bond     1012478   -1.0        97        85        76.0
2   Hope Williams     1478978   89.0        85        76        -1.0

Upvotes: 1

Related Questions