Reputation: 171
I have .txt file with 6,000,000 rows. There are 140,000 rows i want to scrape. Im using notepad++ insted of regex101 because there are too many rows to scrape. The whole document looks like that:
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"
Sender: nick <[email protected]>
Message: Thats my message**
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"
Sender: another-nick <[email protected]>
Message: Another message
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="UTF-8"
Of course it's not exactly looks like that. Rows which arent "Sender" and "Message" are a little bit random. I want to extract every email which is in the row with "Sender" and every message which is under email. Of course i want to combine message with email, so i have to have Sender and his message.
For example:
email1 - his message
email2 - his message
email3 - his message
Ok, it's seems to be pretty easy, right? The problem is, that when im searching for:
Sender: .+ <.+>
it gives mi 140,000 rows
But when i search for:
Message: .+
it gives me 139,094 rows. I tried to find "broken rows" with that:
^(?!Sender: .+ <.+>)\r\n\Message: .+)
But that is not working. I think my coding skills are not good enough. I just dont know where i did mistake.
I also tried to find "good" rows with:
Sender: .+ <.+>\r\n\Message: .+
And it's working properly. But i dont know how to extract that. I add bookmarks to every found regexp and it looks like that:
http://puu.sh/nL6n4/3f6331b16b.png
And now, when i click "Search -> Bookmark -> copy bookmarked lines" i have only:
Sender: nick <[email protected]>
Sender: another-nick <[email protected]>
Without messages.. Im so tired of it. Can somebody help me with that?
Upvotes: 0
Views: 82
Reputation: 8413
I hope I understood your question correctly, here is, how I would do it like this:
Open file in Notepad++ then press Ctrl+F to open search dialog and change tab to "mark". Then check "Mark line" and activate Regular expressions.
The first regular expression to search is Sender:[^<\r\n]*<([^\r\n]*)>\r?\nMessage:\s*([^\r\n]*)
. This will bookmark all the lines starting with sender (and followed by a Message-Line).
However this doesn't mark the Message-line, as Notepad++ doesn't support this - but we can trick a bit by doing another mark-search. Now the regular expression is Sender:[^<\r\n]*<([^\r\n]*)>\r?\n\KMessage:\s*([^\r\n]*)
. Note the \K
to reset the mark start. Now also the message lines are marked.
Go to search - bookmark - remove unmarked lines so that only your sender and message-lines are left.
Now its time for a replace, again using regular expression Sender:[^<\r\n]*<([^\r\n]*)>\r?\nMessage:\s*([^\r\n]*)
and replace it with $1 - $2
.
Upvotes: 1