user1356386
user1356386

Reputation:

Fastest way to capitalize words

What is the fastest way to capitalize words (std::string) using C++?

On Debian Linux using g++ 4.6.3 with the -O3 flag, this function using boost::to_lower will capitalize 81,450,625 words in roughly 24 seconds in a single thread of execution on a AMD Phenom(tm) II X6 1090T Processor (3200 MHz).

void Capitalize( std::string& word )
{
    boost::to_lower( word );
    word[0] = toupper( word[0] );
}

This function using std::transform does the same thing in roughly 10 seconds. I clear the VM between testing, so I don't think this difference is a fluke:

sync && echo 3 > /proc/sys/vm/drop_caches

void Capitalize( std::string& word )
{
    std::transform(word.begin(), word.end(), word.begin(), ::tolower);
    word[0] = toupper( word[0] );
}

Are there faster ways? I would not want to lose portability for the sake of speed, but if there are faster ways to do this that work in std C++ or std C++ with boost, I'd like to try them.

Thanks.

Upvotes: 8

Views: 5201

Answers (6)

Adversus
Adversus

Reputation: 2226

Had this exact question when dealing with DNA sequences where the input is not guaranteed to be upper case, and where boost::to_upper was a bottle-neck in the code. Changing to this:

template<typename T_it>
void SequenceToUpperCase( T_it begin, T_it end )
{
    // Convert to upper: clear the '32' bit, 0x20 in hex. And with the
    // inverted bit string (~).
    for ( auto it = begin; it != end; ++it )
        *it &= ~0x20;
}

resulted in a huge speed increase. I'm sure it is possible to further optimize by e.g. flipping 8 bytes at once but with the above code the upper-case is near-instantaneous for us. For lower-case: do:

        *it |= 0x20;

Upvotes: 7

user45389
user45389

Reputation: 81

I would use a for loop to go through the entire string, character by character, parsing them to a function to convert to uppercase. To hopefully speed things up, I'll write the capitalisation function in assembly. The C++ component of the application would be like so:

#include <iostream>
#include <string>

extern "C" char asm_capt(char x);

using namespace std;

int main(void)
{
    string str;
    cin >> str;
    string tmp = str;

    for(int i=0; i<str.length(); i++){

        tmp.at(i) = asm_capt(str.at(i));
    }

}

Now comes the assembly part; I'll assume you are using Windows and the MASM compiler. The code should be saved within a .asm source file, and included in the project with MASM build settings:

.model flat
.code
_asm_capt proc

    mov rax, rcx
    cmp rax, 61h
    jl already_capt

    sub rax, 20h
    ret

    already_capt:

    ret

_asm_capt endp
end

Essentially, it checks if the character (in hexadecimal) is less than 0x61, which would imply, if you only use letters, that it is already capitalised. otherwise, the value is decreased by 0x20, which moves it to the lower case equivalent. (See ASCII table.)


Note: By default, the return parameter is stored in the RAX register; the first parameter passed by C++ to assembly is stored in RCX.

Upvotes: 0

Anand Rathi
Anand Rathi

Reputation: 786

I have an implementation I found it faster than std::transform , Compiled in g++ -03 Fedora 18.

performance time in seconds :
transform took         : 11 s
my implementation took : 2 s
Test data size = 26*15*9999999 chars
inline void tolowerPtr(char *p) ;

inline void tolowerStr(std::string& s)
{char* c=const_cast<char*>(s.c_str());
size_t l = s.size();
  for(char* c2=c;c2<c+l;c2++)tolowerPtr(c2); 
};

inline void tolowerPtr(char *p) 
{
switch(*p)
{
  case 'A':*p='a'; return;
  case 'B':*p='b'; return;
  case 'C':*p='c'; return;
  case 'D':*p='d'; return;
  case 'E':*p='e'; return;
  case 'F':*p='f'; return;
  case 'G':*p='g'; return;
  case 'H':*p='h'; return;
  case 'I':*p='i'; return;
  case 'J':*p='j'; return;
  case 'K':*p='k'; return;
  case 'L':*p='l'; return;
  case 'M':*p='m'; return;
  case 'N':*p='n'; return;
  case 'O':*p='o'; return;
  case 'P':*p='p'; return;
  case 'Q':*p='q'; return;
  case 'R':*p='r'; return;
  case 'S':*p='s'; return;
  case 'T':*p='t'; return;
  case 'U':*p='u'; return;
  case 'V':*p='v'; return;
  case 'W':*p='w'; return;
  case 'X':*p='x'; return;
  case 'Y':*p='y'; return;
  case 'Z':*p='z'; return;
};
return ;
}

void testtransform( std::string& word )
{
std::string word2=word; 
time_t t;
time_t t2;
time(&t);
std::cout << "testtransform: start " << "\n";
int i=0;
for(;i<9999999;i++) 
{    word2=word;
    std::transform(word2.begin(), word2.end(), word2.begin(), ::tolower);
}
time(&t2);
std::cout << word2 << "\n";
std::cout << "testtransform: end " << i << ":"<< t2-t << "\n";
}

void testmytolower( std::string& word )
{
std::string word2=word; 
time_t t;
time_t t2;
time(&t);
std::cout << "testmytolower: start " << "\n";
int i=0;
for(;i<9999999;i++)
{   word2=word;
    cstralgo::tolowerStr(word2);
}
time(&t2);
std::cout << word2 << "\n";
std::cout << "testmytolower: end " << i << ":"<< t2-t << "\n";
}

int main(int argc, char* argv[])
{
   std::string word ="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
   word =word+word+word+word+word+word+word+word+word+word+word+word+word+word+word;
   testtransform( word);
   testmytolower( word);
   return 0;
}

I will be glad to know if performance can be improved further.

Upvotes: 1

commanderdileep
commanderdileep

Reputation: 15

Required pseudocode would be:

for(int i=0 to given_str.length)
{ upper[i]=(char*)given_str[i]-32; // in ascii tbl, any lower case char - upper case char=32
}
return upper;

Upvotes: -1

Daniel
Daniel

Reputation: 31579

Few ways to make it faster:
1. Don't use to_lower, it's slow. Don't use ifs either, Make a table with 256 entries that maps from character to its lowercase version and another table for uppercase.
2. Don't use transform, take a pointer to the first character and loop until the null terminator.
3. If memory is not a problem, use a table that maps 2 character sequences. In this case you will need another table that handles termination.
4. If you can do this in assembly, it will be much faster.

Upvotes: 5

inkooboo
inkooboo

Reputation: 2944

If capitalizing is a really bottleneck then write your own implementation of capitalization with hand-written cycle and inline toupper/tolower functions. Use ASM if it is necessary.

Upvotes: 2

Related Questions