How can I clean this data for easier visualizing?

Question

I'm writing a program to read a set of data rows and quantify matching sets. I have the code below however would like to cut, or filter the numbers which is not being recognized as a match.

import collections

a = "test.txt" #This can be changed to a = input("What's the filename? ", )
line_file = open(a, "r")
print(line_file.readable()) #Readable check.
#print(line_file.read()) #Prints each individual line.

#Code for quantity counter.
counts = collections.Counter() #Creates a new counter.
with open(a) as infile:
    for line in infile:
        for number in line.split():
            counts.update((number,))
for key, count in counts.items():
    print(f"{key}: x{count}")

line_file.close()

This is what it outputs, however I'd like for it to not read the numbers at the end and pair the matching sets accordingly.

A2-W-FF-DIN-22: x1
A2-FF-DIN: x1
A2-W-FF-DIN-11: x1
B12-H-BB-DD: x2
B12-H-BB-DD-77: x1
C1-GH-KK-LOP: x1

What I'm aiming for is so that it ignored the "-77" in this, and instead counts the total as x3

B12-H-BB-DD: x2 
B12-H-BB-DD-77: x1

Erick Shepherd · Accepted Answer

You could use a regular expression to create a matching group for a digit suffix. If each number is its own string, e.g. "A2-W-FF-DIN-11", then a regular expression like (?P.+?)(?:-(?P\d+))?\Z could work.

Here, (?P.+?) is a non-greedy match of any character except for a newline grouped under the name "base", (?:-(?P\d+))? matches 0 or 1 occurrences of something like -11 occurring at the end of the "base" group and puts the digits in a group named "suffix", and \Z is the end of the string.

This is what it does in action:

>>> import re
>>> regex = re.compile(r"(?P.+?)(?:-(?P\d+))?\Z")
>>> regex.match("A2-W-FF-DIN-11").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': '11'}
>>> regex.match("A2-W-FF-DIN").groupdict()
{'base': 'A2-W-FF-DIN', 'suffix': None}

So you can see, in this instance, whether or not the string has a digital suffix, the base is the same.

All together, here's a self-contained example of how it might be applied to data like this:

import collections
import re

regex = re.compile(r"(?P.+?)(?:-(?P\d+))?\Z")

sample_data = [
    "A2-FF-DIN",
    "A2-W-FF-DIN-11",
    "A2-W-FF-DIN-22",
    "B12-H-BB-DD",
    "B12-H-BB-DD",
    "B12-H-BB-DD-77",
    "C1-GH-KK-LOP"
]

counts = collections.Counter()

# Iterates through the data and updates the counter.
for datum in sample_data:
    
    # Isolates the base of the number from any digit suffix.
    number = regex.match(datum)["base"]
    
    counts.update((number,))

# Prints each number and prints how many instances were found.
for key, count in counts.items():
    
    print(f"{key}: x{count}")

For which the output is

A2-FF-DIN: x1
A2-W-FF-DIN: x2
B12-H-BB-DD: x3
C1-GH-KK-LOP: x1

Or in the example code you provided, it might look like this:

import collections
import re

# Compiles a regular expression to match the base and suffix
# of a number in the file.
regex = re.compile(r"(?P.+?)(?:-(?P\d+))?\Z")

a = "test.txt"
line_file = open(a, "r")
print(line_file.readable())  # Readable check.

# Creates a new counter.
counts = collections.Counter()

with open(a) as infile:

    for line in infile:

        for number in line.split():

            # Isolates the base match of the number.
            counts.update((regex.match(number)["base"],))

for key, count in counts.items():

    print(f"{key}: x{count}")

line_file.close()

How can I clean this data for easier visualizing?

Answers (2)

Related Questions