Reputation:
I´m trying to learn C#, coming from a Python/PHP background, and I´m trying to port a script from Python to getting started.
The script reads a text file line by line (about 150K lines), apply a list of regex until one is matched, get the named groups results and add the values as properties of a class.
Here´s how the data looks like (each line starting by 'No.' is the beginning of a new record):
No.813177294 09/01/1987 150 Tit.INCAL INDÚSTRIA DE CALÇADOS LTDA (BR/PE) *PARÁGRAFO ÚNICO DO ART. 162 DA LPI. Procurador: ROBERTO C. FREIRE No.901699870 02/06/2009 LD6 *Exigência Formal não respondida, Pedido de Registro de Marca considerado inexistente, de acordo com o Art. 157 da LPI No.830009817 12/12/2008 003 Tit.BIOLAB SANUS FARMACÊUTICA LTDA. (BR/SP) C.N.P.J./C.I.C./NºINPI : 49475833000106 Apres.: Nominativa ; Nat.: De Produto Marca: ENXUG NCL(9) 05 medicamentos para uso humano; preparações farmacêuticas; diuréticos, analgésicos; anestésicos; anti-helmínticos; antibióticos; hormônios para uso medicinal. Procurador: CRUZEIRO/NEWMARC PATENTES E MARCAS LTDA
And how the regex looks like:
regexp = {
# No.123456789 13/12/2008 560
# No.123456789 13/12/2008 560
# No.123456789 13/12/2008 560
# No.123456789 560
'number': re.compile(r'No.(?P<Number>[\d]{9}) +((?P<Date>[\d]{2}/[\d]{2}/[\d]{4}) +)?(?P<Code>.*)'),
# NCL(7) 25 no no no no no ; no no no no no no; *nonono no non o nono
# NCL(9) 25 no no no no no ; no no no no no no; *nonono no non o nono
'ncl': re.compile(r'NCL\([\d]{1}\) (?P<Ncl>[\d]{2})( (?P<Especification>.*))?'),
'doc': re.compile(r'C.N.P.J./C.I.C./NºINPI : (?P<Document>.*)'),
'description': re.compile(r'\*(?P<Description>.*)'),
...
}
1) Can I use the same concept, applying each of a Dictionary<string, Regex>
in each line until one is matched?
2) If I do, there´s a way to get a Dictionary<string, string>
of the named groups results? (At this stage I can treat everything as a string).
3) If supposed I have a class like this...
class Record
{
public string Number { get; set; }
public string Date { get; set; }
public string Code { get; set; }
public string Ncl { get; set; }
public string Especification { get; set; }
public string Document { get; set; }
public string Description { get; set; }
}
...there is a way to set the properties with the values of the named groups?
4) I´m totally missing the point here, trying to code in a static typed language still thinking in a dynamically typed one? If this is the case, what can I do?
Sorry for this somewhat lengthy question. I really tried to resume to make this shorter :-)
Thanks in advance.
Upvotes: 1
Views: 1470
Reputation: 136603
(?<first>group)(?'second'group)
, the returned Match object will support named retrieval like this. You can build youself a dictionary from this object or directly pass the Match objectvar match = Regex.Match("subject", "regex");
var matchedText = match.Groups("first")
Record Record.Parse(namedValueCollection)
would be a way to do itUpvotes: 2
Reputation: 11702
dictionary<string,string> dic_test = new dictionary<string,string>();
dic_test.add(key,value);
Upvotes: 0
Reputation: 415600
What you're looking for sounds do-able. Of course you'll want to look at System.Text.RegularExpressions
, specifically the Regex
type there.
Additionally, I'm really fond of the iterator pattern for reading lines from a file:
public static IEnumerable<string> ReadLines(string path)
{
using(var sr = new StreamReader(path))
{
string line;
while ( (line = sr.ReadLine()) != null)
{
yield return line;
}
}
}
You start with that base code (which you can re-use almost everywhere) and call it in this method:
public static IEnumerable<Record> ReadRecords(string path)
{
IEnumerable<Regex> expresssions = new List<Regex>
{
new Regex( @"No.(?P<Number>[\d]{9}) +((?P<Date>[\d]{2}/[\d]{2}/[\d]{4}) +)?(?P<Code>.*)" ),
new Regex( @"NCL\([\d]{1}\) (?P<Ncl>[\d]{2})( (?P<Especification>"),
new Regex( @"C.N.P.J./C.I.C./NºINPI : (?P<Document>.*)")
};
foreach ( MatchCollection matches
in ReadLines(path)
.Select(s => expressions.First(e => e.IsMatch(s)).Matches(s)))
.Where(m => m.Count > 0)
)
{
yield return Record.FromExpressionMatches(matches);
}
}
Finish it up by adding a static factory method to your Record class that accepts a MatchCollection parameter. The one thing it looks like you're missing here is that you expect to hit each of the expressions once before completing a single record. That will work a little differently. But hopefully this gives you enough to get you really going.
Upvotes: 1
Reputation: 4193
If you really want to learn C#, you should demand only references and not full answers, like this one (RegEx class), but I'm sure you can find much more information with a quick Google search too.
Upvotes: 1
Reputation: 881487
1., sure
2., see e.g. here
3., yep, same basic concept as 2
4., nah, C# is flexible enough to allow you to port your architecture over
Also consider studying this book as the best intro to .NET for Python programmers AND vice versa (I'm biased, having been a tech editor and being a friend of the author, but I think this is objectively defensible;-).
Upvotes: 3
Reputation: 65466
Sorry this is not a specific answer, but could you use IronPython to convert your scripts to run under the CLR and then step to C#?
Upvotes: 1