Saadevni
Saadevni

Reputation: 9

How do I get the list of the longest common substrings with a minimum length in a Pandas DataFrame column of strings that is over 10000 entries?

So I have a pandas DataFrame where there are over 50000 rows. In one column, there is a list of strings that has no spaces, and is converted to lowercase, and there are multiple junk characters. The column value represents a message from an internal messaging system, which has a specific format.

There is a base assumption that there is no spelling error in certain common words like "Please", "Regards", "Thanks", "Reference", etc.

I have managed to clean out the message and get just the body, and in the message body I have been trying to find a list of the longest common substrings between each and every row.

That is, a string of characters (representing words) that were repeated in a lot of messages.

There is no length limit on the message column, so messages could go upto 10000 characters.

The condition is that, the matched substring has to be a minimum of 20 characters, so as to filter out any reference codes that could be repeated in multiple messages.

For an example of what a common substring could probably look like, this is the output I am expecting (column names are for reference only, not the actual names I will be using):

String No of matches Index of first match (0-indexed)
wedonothaveanydetailsonthereferenceprovided 102 23434
wehavereceivedyourmessageandweacknowledge 80 28362
thereisnodataprovidedintheinitialmessage 34 2312

etc.

For context, this is how the original message would look like:

Received message form

The data has the message in the format given below:

-attentionwithregardstoyourmessagedated01/03/2013wehavereceivedyourmessageandweacknowledgeunfortunatelyourdeptdoesnthaveanydetailsonthespecificreferenceyouhavesentpleaseprovideuswithmoredetailsonthepartiesinvolvedortheticketnumberregardssupportdept

Representing an initial message of

2-ref/02/03/2013/492873
3-orig/01/03/2013/294715
10-Attention
With regards to your message dated 01/03/2013
we have received your message, and we acknowledge.
Unfortunately, our dept. doesn't have any details on the specific reference you have sent.
Please provide us with more details on the parties involved or the ticket number.- Regards, Support Dept.
13-end

Note - the message data is not in the above format, the above format is for reference only.

The only solutions that I could find for this problem were all in O(n^2) time, which is very slow for the amount of data that I'm working on.

Also, there was no way to implement the minimum match length condition.

I had tried implementing suffix trees, but I faced difficulty adapting them for a Pandas DataFrame column, and scaling them up for 10000 rows.

Is there any algorithm or pseudocode that works for a Pandas DataFrame with over 10000 rows, that could provide me a list of strings like in the output above? I'm open to suggestions that could point me in the right direction.

It would also be helpful if there was a way to implement the suffix tree for the DataFrame, since it looked like the fastest solution to this problem (O(n) time).

Thanks!

Upvotes: 0

Views: 80

Answers (0)

Related Questions