audiophonic
audiophonic

Reputation: 171

Regexp notepad++ how to put not properly

I have .txt file with 6,000,000 rows. There are 140,000 rows i want to scrape. Im using notepad++ insted of regex101 because there are too many rows to scrape. The whole document looks like that:

MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"

Sender: nick <[email protected]>
Message: Thats my message**

MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"

Sender: another-nick <[email protected]>
Message: Another message

MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"

Of course it's not exactly looks like that. Rows which arent "Sender" and "Message" are a little bit random. I want to extract every email which is in the row with "Sender" and every message which is under email. Of course i want to combine message with email, so i have to have Sender and his message.

For example:

email1 - his message
email2 - his message
email3 - his message

Ok, it's seems to be pretty easy, right? The problem is, that when im searching for:

Sender: .+ <.+> 

it gives mi 140,000 rows

But when i search for:

Message: .+

it gives me 139,094 rows. I tried to find "broken rows" with that:

^(?!Sender: .+ <.+>)\r\n\Message: .+)

But that is not working. I think my coding skills are not good enough. I just dont know where i did mistake.

I also tried to find "good" rows with:

Sender: .+ <.+>\r\n\Message: .+

And it's working properly. But i dont know how to extract that. I add bookmarks to every found regexp and it looks like that:

http://puu.sh/nL6n4/3f6331b16b.png

And now, when i click "Search -> Bookmark -> copy bookmarked lines" i have only:

Sender: nick <[email protected]>
Sender: another-nick <[email protected]>

Without messages.. Im so tired of it. Can somebody help me with that?

Upvotes: 0

Views: 82

Answers (1)

Sebastian Proske
Sebastian Proske

Reputation: 8413

I hope I understood your question correctly, here is, how I would do it like this:

Open file in Notepad++ then press Ctrl+F to open search dialog and change tab to "mark". Then check "Mark line" and activate Regular expressions.

Mark Dialog

The first regular expression to search is Sender:[^<\r\n]*<([^\r\n]*)>\r?\nMessage:\s*([^\r\n]*). This will bookmark all the lines starting with sender (and followed by a Message-Line).

First Mark

However this doesn't mark the Message-line, as Notepad++ doesn't support this - but we can trick a bit by doing another mark-search. Now the regular expression is Sender:[^<\r\n]*<([^\r\n]*)>\r?\n\KMessage:\s*([^\r\n]*). Note the \K to reset the mark start. Now also the message lines are marked.

Second Mark

Go to search - bookmark - remove unmarked lines so that only your sender and message-lines are left.

Remove Unmarked enter image description here

Now its time for a replace, again using regular expression Sender:[^<\r\n]*<([^\r\n]*)>\r?\nMessage:\s*([^\r\n]*) and replace it with $1 - $2.

Final

Upvotes: 1

Related Questions