Reputation: 1791
There are lots of posts here about posting strings but non actually seems to fit to my purpose.
I'm using std::string
and all of C++ standard libraries, and I have a text file using the following protocol:
TEXT1:TEXT2-TAB-TEXT3:TEXT4 TEXT5
When -TAB-
is \t
.
I want to get all the text into strings (could be an array too). All of the lines in the file are written this way, I tried using istringstream
but it has no functionality such as: iss >> text1 >> ":" >> text2 >> "\t" >> text3 >> ":" >> text4 >> " " >> text5
.
Do I actually need to parse using the basic functions of find
etc. ? That'd be just a ton of work (because I have a couple of files written in different formats and I'll need to make a general function for all of these), I would do that if I have no choice though.
So... is there any way to parse strings the following ways, using known characters between strings? It's not a specific delimiter because each line contains a couple delimiters (one time it's space, then a colon and so). I want to use C++ standard libraries and not any external library such as Boost.
EDIT: C++11.
Upvotes: 0
Views: 312
Reputation: 490018
Since you have a single, fixed character to mark the end of each field, anything like regexes borders on overkill. I'd just use std::getline
to read each field.
I'd start by defining a struct for the fields in one line, and overloading operator>>
to read one of those structs:
struct line {
std::string text1, text2, text3, text4, text5;
friend std::istream &operator>>(std::istream &is, line &l) {
std::getline(is, l.text1, ':');
std::getline(is, l.text2, '\t');
std::getline(is, l.text3, ':');
std::getline(is, l.text4, ' ');
std::getline(is, l.text5);
return is;
}
};
With that, you can read a line like:
line x;
std::cin >> x;
...or, if you have an entire file full of lines like that, you can read them all into a vector, something like:
std::ifstream infile("whatever.dat");
std::vector<line> lines {
std::istream_iterator<line>(lines),
std::istream_iterator<line>()
};
Upvotes: 3
Reputation: 1123
Since you are using C++11 and your text lines abide by a protocol, then the tool to use for pattern matching and information extraction are the features found in the regex library.
The pattern to match your protocol may look something like this...
\w+:\w+-\t-\w+:\w+\s\w+
... using the default ECMAScript syntax. There are a few others.
Next, use a raw string literal to initialize a regex object...
regex pat{R("\w+:\w+-\t-\w+:\w+\s\w+")};
So now your code can look like this...
#include<regex>
...
regex pat{R("\w+:\w+-\t-\w+:\w+\s\w+")};
smatch m;
while (cin >> str) { // where str is your line of formatted text
bool match = regex_search(str, m, pat);
for (int i = 0; i < m.size(); i++) {
cout << m[i].str() << " "; // to make sure each component was matched
}
}
By the way, smatch works like a container and can be iterated so it's very convenient.
Note: The above code is not guaranteed to work, it is being used as a guide.
Upvotes: 5
Reputation: 1
You probably should read an entire line using std::getline then parse that line, e.g. finding the '\t'
character using find or find_first_of method of std::string.
If possible, switch to C++11 at least, since many features of C++11 would enable you to write less code. In particular std::find from <algorithm>
is helpful when used with an anonymous lambda.
Of course, you should define more formally the acceptable input (perhaps with some EBNF notations, at least in comments). In particular, what exact characters can appear in your TEXT1
and TEXT2
and TEXT3
and TEXT4
and TEXT5
. In what encoding? (UTF-8 has multibyte characters!).
If the input specification is complex, you might consider using some parser generator like ANTLR, etc..
Upvotes: 0