Reputation: 35
How can I match a string from a column in a .csv
file against a list of defined datatypes?
The list of data types is:
Datatype = ["M", "B", "E", "A", "DF", "DW", "DL", "DT", "XTEXT", "MDEDATA"]
I want to use the data type that matches the column to look up a value in a dictionary.
When my code finds something like DW1
in the input (see below for examples) it writes DW:5
to the output file, and when it finds DT34
it writes DT:7
. This works fine.
But the problem is, whenever my code comes across XTEXT
it matches it first to E: 2
(because E
is present in XTEXT
) and then to XTEXT: 4105
in the next row. As a result, the corresponding line in the output file contains 2
instead of 4105
. Something similar happens for MDEDATA
which is resolved to A:3
.
I tried using a regex to remove numbers from the relevant columns in the input file (this line is commented out in the code above), like this:
if (Datatype[n] == re.sub('[1234567890\n]', '', line.split(";")[1])):
# ...
but this causes problems when encountering strings like X256DATA
.
Another issue is that the output file that is produced currently displays abc;DW1
in the first column and 5
in the second column, but I want abc
in the first column, DW1
in the second and 5
in the third column!
How can I solve these problems? Any help appreciated.
Input:
1 abc DW1
2 uz5 XTEXT
Current output:
1 abc DW1 5
2 uz5 XTEXT 4105
Input:
abc;DW1\n
uz5;XTEXT\n
Current output:
abc;DW1,5\n
uz5;XTEXT,4105\n
abc;DW1;5\n
uz5;XTEXT;4105\n
Upvotes: 1
Views: 2431
Reputation: 5280
OK, here is a solution based on your code that should work for the format of your .txt
files:
import csv
# Define data_type / dict1 / dict2:
data_type = [
"M", "E", "B", "A", "DF", "DW", "DL", "DT", "PF", "PW", "PL", "PT",
"XTEXT", "MDEDATA", "X8DATA", "X16DATA", "X32DATA", "X64DATA",
"X128DATA", "X256DATA", "X512DATA", "X1024DATA", "X2048DATA",
"X4096DATA", "X8912DATA"]
dict1 = {
"M": 1, "E": 2, "A": 3, "DF": 4, "DW": 5, "DL": 6, "DT": 7, "PF": 8,
"PW": 9, "B": 10, "PL": 11, "PT": 12, "XTEXT": 4105, "MDEDATA": 4110,
"X8DATA": 10000, "X16DATA": 10001, "X32DATA": 10002, "X64DATA": 10003,
"X128DATA": 10004, "X256DATA": 10005, "X512DATA": 10006,
"X1024DATA": 10007, "X2048DATA": 10008, "X4096DATA": 10009,
"X8912DATA": 10010}
dict2 = {
"M": "B", "E": "B", "A": "B", "DF": ">f","PF": ">f", "DW": ">h",
"PW": ">h", "DL": ">l", "PL": ">l", "DT": "String", "PT": "String",
"B": "B", "XTEXT": "XTEXT", "MDEDATA": "MDEDATA", "X8DATA": "X8DATA",
"X16DATA": "X16DATA", "X32DATA": "X32DATA", "X64DATA": "X64DATA",
"X128DATA": "X128DATA", "X256DATA": "X256DATA", "X512DATA": "X512DATA",
"X1024DATA": "X1024DATA", "X2048DATA": "X2048DATA",
"X4096DATA": "X4096DATA", "X8912DATA": "X8912DATA"}
# Obtain writer for output file:
source = 'GSV.txt'
dest = open('GSV_copy.txt', 'w')
# Process data:
import re
source_lines = open(source).readlines()
for line in source_lines:
x = line.split(';')[1]
icon_lletter = ''
icon_lnummer = 0
python_letter = ''
for t in data_type:
if x.startswith(t): # Alternative using re: "if re.match(t, x)"
icon_lletter = t
break
if (icon_lletter not in dict1):
print "Entry (%s) in Dictionary1 is not available" % (icon_lletter)
else:
icon_lnummer = dict1[icon_lletter]
if (icon_lletter not in dict2):
print "Entry (%s) in Dictionary2 is not available" % (icon_lletter)
else:
python_letter = dict2[icon_lletter]
print "Converted the GVR datatype to python datatype is %s : %s : %s" \
% (icon_lletter, python_letter, icon_lnummer)
# Write line to output file:
dest.write(';'.join([line.strip('\n'), str(icon_lnummer) + '\n']))
dest.close()
Please note:
With this solution I am trying to fix your problem while staying close to your original implementation. There are different ways of achieving your goal that are probably more Pythonic / efficient / DRY.
Since your input and ouput files do not have the same number of columns and differ in terms of formatting, you will eventually want to turn the code into a (set of) function(s) that take the things that differ (file type and column to check) as parameters.
Edit:
Working on this some more I noticed that the output in dest
was a bit garbled. (I hadn't gotten that far before :)) I've changed the code to produce the exact output you want now.
Another Edit (based on discussion in comments):
On the topic of matching data types in the input file(s) against the list of defined data_type
s: If the DT
s in your input file always have the numbers that you don't want to consider at the beginning or at the end, you could use
re.sub('^[0-9]+', '', x)
or
re.sub('[0-9]+$', '', x)
to remove them.
Upvotes: 1
Reputation: 35
As suggested by itsjeyd,
re.sub('^[0-9]+', '', x) or re.sub('[0-9]+$', '', x) does the trick!
Upvotes: 0
Reputation: 25207
Normal dictionary lookup matches the key exactly. If you get multiple matches you are doing something else.
>>> dict1 = {"M":1, "E":2, "A":3, "DF":4, "DW":5, "DL":6, "DT":7, "B":10, "XTEXT":4105, "MDEDATA":4110}
>>> dict1["XTEXT"]
4105
Now from the updated question I see you are not really looking for exact matches, as DW1
in file should match DW
in Datatype
. If you want to check if the beginning of the string matches, use str.startswith
:
>>> "DW1".startswith("DW")
True
Note that you must check MDEDATA
before M
to avoid a false match. More generally, check longer matches before shorter ones.
Upvotes: 1