Germstorm
Germstorm

Reputation: 9849

Extracting content from text files with generic rules

I have a lot of text data with different structure. I need to extract parts of these texts based on some text-based rules. I would use regular expressions but unfortunately the people who are using the application have never heard of it.

Basically the app does the following thing:

  1. Load the data into a textbox
  2. Type the structure of the output as a simple set of rules into another textbox
  3. Receive the results in a 3rd textbox

Examples of data structures (I have megabytes of this data):

Label1: value1, measurement
Label2; value2; something else
Nr, value3 (comment)
...

I need some other approach that I could use instead of regular expressions. It can be extremely simple because all I need is one value from every row.

From the example above I have to obtain the following structure:

"value1, value2, value3"

Is there a simpler alternative to regex? Did someone already implement something like this?

I can also imagine that I am approaching the problem from the wrong angle, like forcing the simple user to write data extraction rules. In this case the question is transformed to something more generic like "How can build an application that lets a very simple user extract data from a separate texts?"

Edit: I have the following simplest as possible matching implemented for them:

File content:

"Strain at break Ax2";"Unknown"
"Strain at break Ax1";"Unknown"
"Strain at break";"Unknown"
"Yield point strain";"Unknown"
"Uniform elongation";25.4087;"%"
"Tensile strength";261.323;"MPa"
"End test phase Yield point";1;"%"
"Maximum tensile force";5.22647;"kN"

Pattern:

"Tensile strength";(?<value>[^;\n]*);
"Maximum tensile force";(?<value>[^;\n]*);

Still too complex. The problem is if I start replacing the ugly part with another string to obtain for example:

"Tensile strength", [First value after]

I loose all the generic nature of the extraction because every file looks different from this one.

Upvotes: 0

Views: 444

Answers (2)

Germstorm
Germstorm

Reputation: 9849

I have solved the issue by defining the rules as regular expressions. After the rules were defined I defined a wrapper rule-set that was easier to read by the users.

Ex. to extract a value from a line

Maximum amount of Sheet Drawing Force= 35.659695[kN]

I defined the regular expression

{0}=\s*(?<value>[^[\n\r]*)

then let the user define the name of the field. The {0} placeholder was then replaced with the name of the field and the regular expression applied.

Upvotes: 0

M.Babcock
M.Babcock

Reputation: 18965

Take a look at the FileHelpers library. It allows runtime generation of file layouts and I think the one that would help in your example is the DelimitedClassBuilder.

In your case, I'd probably use FileHelpers to parse the record definitions into the DelimitedClassBuilder and then use the result to parse your records.

Upvotes: 1

Related Questions