alvas
alvas

Reputation: 122082

Fastest way to replace space for underscore for a list of words in text

Given 10,000,000,000 lines of around 20-50 words each line, e.g.:

Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .
However , others argue that while anti-statism is central , it is inadequate to define anarchism .
Therefore , they argue instead that anarchism entails opposing authority or hierarchical organization in the conduct of human relations , including , but not limited to , the state system .
Proponents of anarchism , known as " anarchists " , advocate stateless societies based on non - hierarchical free association s. As a subtle and anti-dogmatic philosophy , anarchism draws on many currents of thought and strategy .
Anarchism does not offer a fixed body of doctrine from a single particular world view , instead fluxing and flowing as a philosophy .
There are many types and traditions of anarchism , not all of which are mutually exclusive .
Anarchist schools of thought can differ fundamentally , supporting anything from extreme individualism to complete collectivism .
Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications .
Anarchism is often considered a radical left-wing ideology , and much of anarchist economics and anarchist legal philosophy reflect anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics .
Anarchism as a mass social movement has regularly endured fluctuations in popularity .
The central tendency of anarchism as a social movement has been represented by anarcho-communism and anarcho-syndicalism , with individualist anarchism being primarily a literary phenomenon which nevertheless did have an impact on the bigger currents and individualists have also participated in large anarchist organizations .
Many anarchists oppose all forms of aggression , supporting self-defense or non-violence ( anarcho-pacifism ) , while others have supported the use of some coercive measures , including violent revolution and propaganda of the deed , on the path to an anarchist society .
Etymology and terminology The term derives from the ancient Greek ἄναρχος , anarchos , meaning " without rulers " , from the prefix ἀν - ( an - , " without " ) + ἀρχός ( arkhos , " leader " , from ἀρχή arkhē , " authority , sovereignty , realm , magistracy " ) + - ισμός ( - ismos , from the suffix - ιζειν , - izein " - izing " ) . "
Anarchists " was the term adopted by Maximilien de Robespierre to attack those on the left whom he had used for his own ends during the French Revolution but was determined to get rid of , though among these " anarchists " there were few who exhibited the social revolt characteristics of later anarchists .
There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation , such as William Godwin and Wilhelm Weitling , but they did not use the word " anarchist " or " anarchism " in describing themselves or their beliefs .
Pierre-Joseph Proudhon was the first political philosopher to call himself an anarchist , making the formal birth of anarchism the mid-nineteenth century .
Since the 1890s from France , the term " libertarianism " has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States ; its use as a synonym is still common outside the United States .
On the other hand , some use " libertarianism " to refer to individualistic free-market philosophy only , referring to free-market anarchism as " libertarian anarchism " .

And let's say I have a list of dictionary terms that is made up of one or more words, e.g:

clinical anatomy
clinical psychology
cognitive neuroscience
cognitive psychology
cognitive science
comparative anatomy
comparative psychology
compound morphology
computational linguistics
correlation
cosmetic dentistry
cosmography
cosmology
craniology
craniometry
criminology
cryobiology
cryogenics
cryonics
cryptanalysis
crystallography
curvilinear correlation
cybernetics
cytogenetics
cytology
deixis
demography
dental anatomy
dental surgery
dentistry
philosophy
political philosophy

And I need to find all sentences that contains any of these terms and then replace the spaces between the words within the terms as underscores.

For example, there is this sentence in the text:

Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .

And there's the dictionary term political philosophy in the text. So the output for this sentence needs to be:

Anarchism is often defined as a political_philosophy which holds the state to be undesirable , unnecessary , or harmful .

I could do this:

dictionary = sort(dictionary, key=len) # replace the longest terms first.
for line in text:
   for term in dictionary: 
       if term in line:
           line = line.replace(term, term.replace(' ', '_'))

Assuming that I have 10,000 Dictionary terms (D) and 10,000,000,000 Sentences (S), the complexity of using the simple method would be O(D*S), right? Is there a faster and less brute-forcey way to achieve the same results?

Is there a way to replace all of the terms with spaces by the terms with underscore for each line? That will help in avoiding the inner loop.

Would it be more efficient if index the text using something like whoosh first then query the index and replace the terms? I would still need something like a O(1*S) to do the replacements, right?

The solution doesn't need to be in Python, even if it's some Unix command tricks like grep/sed/awk, it's fine, as long as subprocess.Popen-able.

Please correct my complexity assumptions if i'm wrong, pardon my noobiness.


Given a sentence:

This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social political philosophy with political philosophy under the branch of philosophy and some computational linguistics where the cognitive linguistics and psycho cognitive linguistics appears with linguistics

And let's say i have the dictionary:

cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy 

The output should look like:

This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social_political_philosophy with political_philosophy under the branch of philosophy and some computational_linguistics where the cognitive_linguistics and psycho_cognitive_linguistics appears with linguistics

And the aim is to do this with a text file of 10 billion lines and a dictionary of 10-100k phrases.

Upvotes: 2

Views: 2236

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

It may be better to split the words, mapping the words from the start of the phrase to the full phrase, if you need the largest, instead of checking every item in the dict you just need to sort the phrases that appear by length:

from collections import defaultdict

def get_phrases(fle):
    phrase_dict = defaultdict(list)
    with open(fle) as ph:
        for line in map(str.rstrip, ph):
            k, _, phr = line.partition(" ")
            phrase_dict[k].append(line)
        return phrase_dict

from itertools import chain


def replace(fle, dct):
    with open(fle) as f:
        for line in f:
            phrases = sorted(chain.from_iterable(dct[word] for word in line.split() 
                             if word in dct) ,reverse=1, key=len)
            for phr in phrases:
                  line = line.replace(phr, phr.replace(" ", "_"))
            yield line

Output:

In [10]: cat out.txt
This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social political philosophy with political philosophy under the branch of philosophy and some computational linguistics where the cognitive linguistics and psycho cognitive linguistics appears with linguistics
In [11]: cat phrases.txt
cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy
In [12]: list(replace("out.txt",get_phrases("phrases.txt")))
Out[12]: ['This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social_political_philosophy with political_philosophy under the branch of philosophy and some computational_linguistics where the cognitive_linguistics and psycho_cognitive_linguistics appears with linguistics']

A few other versions:

def repl(x):
    if x:
        return x.group().replace(" ", "_")
    return x


def replace_re(fle, dct):
    with open(fle) as f:
        for line in f:
            spl = set(line.split())
            phrases = chain.from_iterable(dct[word] for word in spl if word in dct)
            line = re.sub("|".join(phrases), repl, line)
            yield line


def replace_re2(fle, dct):
    cached = {}
    with open(fle) as f:
        for line in f:
            phrases = tuple(chain.from_iterable(dct[word] for word in set(line.split()) if word in dct))
            if phrases not in cached:
                r = re.compile("|".join(phrases))
                cached[phrases] = r
                line = r.sub(repl, line)
            else:
                line = cached[phrases].sub(repl, line)
            yield line

Upvotes: 1

user557597
user557597

Reputation:

I would make a regex of your Dictionary to match the data.
Then on the replacement side, use a callback to replace spaces with _.

I estimate it would take less than 3 hours to do the whole thing.

Fortunately there is a Ternary Tool (Dictionary) regex generator.

To generate the regex and for what is shown below, you'll need the Trial
version of RegexFormat 7

Some links:
Screenshot of tool
TernaryTool(Dictionary) - Text version Dictionary samples
A 175,000 word Dictionary Regex

You basically generate your own Dictionary
by dropping in the strings you want to find, then press the Generate button.

Then all you have to do is read in 5 MB chunks and do a find/replace using the
regex, then append it to the new file.. rinse repeat.
Pretty simple really.

Based on your sample (above) this is an estimate of the time it would take
to complete 10 Billion lines.

This analysis is based on using a benchmark that was run on your sample input using the generated regex (below).

19 lines  (@ 3600 chars)

Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   5
Elapsed Time:    4.03 s,   4034.28 ms,   4034278 µs

////////////////////////////
3606 chars
x 50,000
------------
180,300,000  (chars)

or 

20 lines
x 50,000
------------
1,000,000  (lines)
=========================
10,000,000,000 lines
/
1,000,000  (lines) per 4 seconds
-----------------------------------------
40,000 seconds
/
3600 secs per hour
-------------------------
11 hours
////////////////////////////

However, if you read in and process 5 megabyte chunks
(as a single string) it will reduce the engine overhead
and have the time down to 1-3 hours.

This is the generated regex for your sample Dictionary (compressed):

\b(?:c(?:linical[ ](?:anatomy|psychology)|o(?:gnitive[ ](?:neuroscience|psychology|science)|mp(?:arative[ ](?:anatomy|psychology)|ound[ ]morphology|utational[ ]linguistics)|rrelation|sm(?:etic[ ]dentistry|o(?:graphy|logy)))|r(?:anio(?:logy|metry)|iminology|y(?:o(?:biology|genics|nics)|ptanalysis|stallography))|urvilinear[ ]correlation|y(?:bernetics|to(?:genetics|logy)))|de(?:ixis|mography|nt(?:al[ ](?:anatomy|surgery)|istry))|p(?:hilosophy|olitical[ ]philosophy))\b

(Note that the space separation are generated as [ ] per space.
If you want to change it to a quantified class, just run a
find (?:\[ \])+ and replace with whatever you want.
For example \s+ or [ ]+
)


Here it is Formatted:

 \b 
 (?:
      c
      (?:
           linical [ ] 
           (?: anatomy | psychology )
        |  o
           (?:
                gnitive [ ] 
                (?: neuroscience | psychology | science )
             |  mp
                (?:
                     arative [ ] 
                     (?: anatomy | psychology )
                  |  ound [ ] morphology
                  |  utational [ ] linguistics
                )
             |  rrelation
             |  sm
                (?:
                     etic [ ] dentistry
                  |  o
                     (?: graphy | logy )
                )
           )
        |  r
           (?:
                anio
                (?: logy | metry )
             |  iminology
             |  y
                (?:
                     o
                     (?: biology | genics | nics )
                  |  ptanalysis
                  |  stallography
                )
           )
        |  urvilinear [ ] correlation
        |  y
           (?:
                bernetics
             |  to
                (?: genetics | logy )
           )
      )
   |  de
      (?:
           ixis
        |  mography
        |  nt
           (?:
                al [ ] 
                (?: anatomy | surgery )
             |  istry
           )
      )
   |  p
      (?: hilosophy | olitical [ ] philosophy )
 )
 \b 

Adding 10,000 phrases is very easy and the regex is no bigger than
the amount of bytes in the phrases plus a bit of overhead to interlace
the regex.

A final note. You can reduce the time even further by only generating the
regex on phrases.. that is only words separated by horizontal whitespace.

And, be sure to pre-compile the regex. Only have to do this once.

Upvotes: 1

Related Questions