Reputation: 117
I have the following Data for example:
HRB 760468: CANNSITE GmbH, Stuttgart, c/o Dr. Elvira Ehle, Rotdornweg 7, 18119 Rostock. Gesellschaft mit beschränkter Haftung. Gesellschaftsvertrag vom 09.03.2017.
HRB 760481: Neckarsee 399. V V GmbH, Stuttgart, Kurt-Schumacher-Straße 18-20, 53113 Bonn. Gesellschaft mit beschränkter Haftung. Gesellschaftsvertrag vom 22.03.2017.
I need to filter out:
HRB 760468: CANNSITE GmbH, Stuttgart, c/o Dr. Elvira Ehle, Rotdornweg 7, 18119 Rostock
HRB 760481: Neckarsee 399. V V GmbH, Stuttgart, Kurt-Schumacher-Straße 18-20, 53113 Bonn
My RegEx is: @"HRB.\d+:[^.]+"
So the problem is with the case "Dr. Elvira" as it contains a "." - the Regex will stop there and not right before "Gesellschaft mit" and I can't get it working to change the Regex to get it to the "Rostock"/"Bonn" at the End in this case.
After that I try to filter out "760468", "CANNSITE GmbH", "Stuttgart", "Rotdornweg 7", "18119" For that I'm doing the following - maybe you can help me improving this (order is like the data above):
Regex regexNummer = new Regex(@"\d+:");
Regex regexFirma = new Regex(@":[^,]+");
Regex regexStadt = new Regex(@", \w+.\w+.\w+.\w+,");
Regex regexAdresse = new Regex(@", \w+.+\d,");
Regex regexPlz = new Regex(@", \d+[^ ]+");
string nummer = regexNummer.Match(match.Value).ToString().Replace(":", "");
string firma = regexFirma.Match(match.Value).ToString().Replace(": ", "");
string plz = regexPlz.Match(match.Value).ToString().Replace(", ", "");
string stadt = regexStadt.Match(match.Value).ToString().Replace(", ", "");
stadt = stadt.Replace(",", "");
string adresse = regexAdresse.Match(match.Value).ToString();
adresse = adresse.Remove(adresse.Length - 1);
adresse = adresse.Substring(adresse.LastIndexOf(", ") + 1);
adresse = adresse.Substring(1);
Because there are so many different types of addresses it often cracks up.
Upvotes: 1
Views: 91
Reputation: 4394
Seems like you have some kind of zip code before the city name. You can potentially exploit that for your regex.
The below Regex works fine to extract the first portion from both your examples.
Regex.Match(txt, @"(^HRB .*?\d{5}\s+\S+\.)")
EDIT:
Modified regex to work with below text too
HRB 760467: APC One UG (haftungsbeschränkt), Rottenburg am Neckar, Lilienthalweg 3, 72108 Rottenburg am Neckar. Gesellschaft mit beschränkter Haftung. Gesellschaftsvertrag vom 22.03.2017. Geschäftsanschrift: Lilienthalweg 3, 72108 Rottenburg am Neckar. Gegenstand: Entwicklung, Entwicklungsberatung, Herstellung sowie Vertrieb von elektronischen Produkten. Stammkapital: 1.500,00 EUR.
Regex.Match(txt, @"(<br>HRB .*?\d{5}\s+[\w\-\s]+\.)")
Upvotes: 2
Reputation: 8336
Maybe RegEx isn't the right tool? Split on commas and parse each block of comma-delimited text. Then maybe, just maybe, you can determine what is in each block of text with a target RegEx that can tell you if that substring is of that type. Still don't know how to handle when multiple patterns match.
Upvotes: 1
Reputation: 3399
I'm no expert on German addresses, but from the examples you give it appears you just need everything from the "HRB" through the word followed by five digits. In regex,
HRB .+ \d{5} \w+
Upvotes: 1