Jacob
Jacob

Reputation: 153

RegEx to compare any character between 59 tabs

I recently received a tab separated file that has 60 fields. Each field can have any character in it. The export I received also has linefeeds and carriage returns in some of the fields. This is causing the tab separated file to not import correctly. Is there a way to remove linebreaks and carriage returns if the line does not have 59 tabs on it? There may or may not be data between each tab.

Sample File

Line 3,4,5 is the issue I'm trying to fix.

Upvotes: 0

Views: 150

Answers (3)

Toto
Toto

Reputation: 91498

I'd do:

Find what: ^((?:[^\t]*\t[^\t]*){1,58})[\r\n]+
Replace with: $1

This will replace line break with nothing if there are less than 59 occurrence of \t character in a line.

Upvotes: 0

Dane
Dane

Reputation: 1211

Warning: I'm assuming that there are no tabs within a column's data. If there is, then you need something far more capable that what I have here.

The following works with the sample input provided:

First, replace all of the line breaks with a character that doesn't occur anywhere in your file. You can even use characters that you can't type with your keyboard.

Find what: (\r\n?|\n)

Replace with: \xB6

Then, match your 60-field rows and give them line-breaks (I'm going with Windows-style):

Find what: ^(([^\t]*\t){59}[^\t\xB6]*)\xB6

Replace with: $1\r\n

I'm making one huge assumption here: that column 60 never contains a line break. If this is false, then you're going to have some of column 60's data ending up in column 1 of the next record.

Now, if you don't like that paragraph symbol showing up in your data, you can either purge it or replace it with whatever you like:

Find what: \xB6

Replace with:

Explanation of matching patterns:

(\r\n?|\n) matches any of the three kinds of line breaks, which are single \r, a single \n, or the Windows-style \r\n. Wikipedia has a whole article about this.

See http://regex101.com/r/iB6fK9 to explore the ^(([^\t]*\t){59}[^\t\xB6]*)\xB6 pattern.

  • I'm matching the beginning of the line with ^ at the start.
  • I have a group of zero or more characters that are not a tab, followed by a tab, that I match exactly 59 times with ([^\t]*\t){59}. That gets us the first 59 tab-separated columns. Only column 59 is captured by this group.
  • For column 60, I match zero or more characters that are neither a tab nor our special character with [^\t\xB6]*.
  • I capture the 60 columns with parentheses, but I leave our special character outside of the captured group so that it gets replaced with the \r\n that we insert with the $1\r\n replacement.

Upvotes: 1

Federico Piazza
Federico Piazza

Reputation: 31045

What I understand from your question is that you want to remove the windows \r\n from your file, to do this you can use replace dialog ctrl+h.

On the Search Mode select Extended (\n, \r,..., then on the "Find What" look for \r\n and in "Replace" leave it empty (or replace it with what you want).

Upvotes: 0

Related Questions