royco
royco

Reputation: 5529

How to split a user-generated string which may contain the delimitter?

I'd like to String.Split() the following string using a comma as the delimitter:

John,Smith,123 Main Street,212-555-1212

The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:

John,Smith,123 Main Street, Apt 101,212-555-1212

I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:

value = value.Replace(",", "*");  

However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.

I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?

EDIT: This is called delimitter collision.

Upvotes: 1

Views: 504

Answers (9)

richardtallent
richardtallent

Reputation: 35404

In a sense, the user is already "escaping" the comma with the space afterward.

So, try this:

string[] values = RegEx.Split(value, ",(?![ ])");

The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.

One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.

Upvotes: 0

Daniel Fortunov
Daniel Fortunov

Reputation: 44423

This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.

You have several options:

  1. Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
  2. Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
  3. Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.

If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)

Upvotes: 4

disjunction
disjunction

Reputation: 656

Funny solution (works if the address is the only field with coma):

Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)

Upvotes: 0

Matt Wrock
Matt Wrock

Reputation: 6640

One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.

Upvotes: 2

Amit
Amit

Reputation: 1057

Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.

Upvotes: 0

Robert Harvey
Robert Harvey

Reputation: 180908

If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.

John,Smith,"123 Main Street, Apt. 6",212-555-1212

Upvotes: 3

gary
gary

Reputation: 521

Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?

The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.

Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).

Upvotes: -1

Dan McClain
Dan McClain

Reputation: 11920

You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.

This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.

The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters

Upvotes: 0

NJE
NJE

Reputation: 769

This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?

Upvotes: 3

Related Questions