Reputation: 31
I'm playing with an app and trying to reverse engineer data files that it can export and import. The files are protobufs in binary. My goal is to be able to export a file, convert to text, modify it with additional data records, re-encode to binary, and reimport it as a way to bypass tedious manual input of data into the app. I have used a protoc
binary on my windows machine with --decode_raw
and can produce nicely readable hierarchical data without knowing the actual .proto
schema used. Using Marc Gravell's parser gives similar results (with some ambiguities I don't quite understand.) My questions are the following:
--decode_raw
to produce the original binary, either using protoc
or another tool? I understand that the raw decode is making assumptions about the unknown schema, and so far it looks like those assumptions work ok to make intelligible results. Is there a loss of data on the raw decode that would prevent re-encoding to the original? Is it just that the protoc
developers didn't see a need to have this feature? With this capability, I could modify the text and re-encode, and have a decent chance of generating a valid binary..proto
file and text message input file to re-encode the original binary using protoc --encode
? I would appreciate a pointer to sample text files that could be used as command line input to protoc
for me to play with to learn the needed syntax. The sample stuff I've seen all appears geared towards using protoc
to generate source code. The binary protobufs I tested have decoded to strings, ints and a few hex values (which I still need to decipher) which correspond well to the data visible in the app, so I have confidence that I can make the required schema if I see working examples.Some preferences: I'm tinkering on my phone and my windows laptop, and would rather not need to install python or another programming platform. I'd just like to use protoc on the command line, and my text/hex editor.
Thanks for any help.
[Edit: I've located a web page that gives sample input, which gave me the clues I needed to make some progress. The page is https://medium.com/@at_ishikawa/cli-to-generate-protocol-buffers-c2cfdf633dce, so thanks to @at_ishikawa for taking the time to make it. With the examples, I understand how to format a message file to generate a binary. However, it looks like the binary I'm trying to decode may not be amenable to the command line. See my new question below.]
New question: I still have the goal of decoding the binary into a text message, editing the text message to add more data records, and re-encoding the modified text message to make a new binary that will hopefully be succesfully imported by the app. Using --decode_raw, I can see that my binary file has the following format:
1 {
1: "ThisItem:name1"
2 {
1: "name1"
2: <string>
4: <string>
5: 1
}
}
1 {
1: "ThisItem:name2"
2 {
1: "name2"
2: <string>
4: <string>
5: 1
}
}
1 {
1: "ThatItem:name1"
2 {
1: "name1"
3: <string>
5: <data structure>
8: <string>
}
}
1 {
1: "ThatItem:name2"
2 {
1: "name2"
3: <string>
5: <data structure>
8: <string>
}
}
1 {
1: "ThisItem:name3"
2 {
1: "name3"
2: <string>
4: <string>
5: 1
}
}
So I see several characteristics of the data structure:
I can then make a .proto file to almost support this structure:
syntax = "proto3";
message RecordList {
repeated Record records = 1;
}
message Record {
string id = 1;
ThisItem item = 2;
ThatItem item = 2; // Problem here, each record uses field 2, but with different message types.
// Each record has either a ThisItem or ThatItem. Parsing the id field could tell which,
// but that doesn't appear possible with protoc on the command line.
}
message ThisItem {
string id = 1;
string <element2> = 2;
string <element4> = 4;
int32 <element5> = 5;
}
message ThatItem {
string id = 1;
string <element3> = 3;
<message type> <element5> = 5;
string <element8> = 8;
}
So I'm not sure if there is a way to decode/encode this binary on the command line. Is there some syntax I can use for the Record message to switch between the two possible choices for field 2 by parsing the string in field 1? If not, I will need to read and parse the records in a program, which is what I wanted to avoid.
One other possibility that I've realized: Instead of two different sub-messages ThisItem
and ThatItem
, I could use one sub-message and skip unused fields. The sub-message would populate fields 1, 2, 4 and 5 in one case, and fields 1, 3, 5 and 8 in the other case. The difficulty is field 5, which is the integer 1 in one case and a data structure in the other case. I'm not sure how to manage that. Is the integer 1 the binary encoding of an empty message?
Thanks for any help.
Upvotes: 2
Views: 2344
Reputation: 3503
I am the maker of rawproto npm lib & tool. I often like to use raw protobuf, even if I have (some or all of) the message SDL, just because it's pretty light to decode, and I don't have to load in a whole (excellent, but big) protobufjs lib. I also like that it's kinda random-access, so I can just pull out only the values I need, and ignore the rest, mostly. For encoding, I don't think there is an existing general solution, but if the message is pretty simple, it's not too bad to encode it by hand. Definitely check out this article on message-format. Super-helpful, if you want to go this route. Maybe it's something I could add to the js lib, in a general way, like give it a field-definition like this:
{
"id": "1.2.4.1:string",
"title": "1.2.4.5:string",
"company": "1.2.4.6:string",
"description": "1.2.4.7:string",
"media": "1.2.4.10",
"dimensions": "1.2.4.10.2",
"width": "1.2.4.10.2.3:uint",
"height": "1.2.4.10.2.4:uint",
"url": "1.2.4.10.5:string",
"type": "1.2.4.10.1:uint",
"bg": "1.2.4.10.15:string"
}
then
encode(fieldDef, { 'id': 'whatever', title: 'Some Title' })
You can see that it ends up about as complicated as just making a proto-definition file, though, so maybe there is not really a point (since you can use all the existing/official tooling with that):
message Parent {
Child1 a = 1;
}
message Child1 {
Child2 b = 2;
}
message Child1 {
Parent c = 4;
}
message Target {
optional string id = 1;
optional string title = 2;
// ...etc
}
// const msg = {a: {b: {c: { title: 'Cool' }}}}
Upvotes: 0
Reputation: 31
Since it looked like protoc on the command line couldn't do what I wanted, I turned to writing a program. The easiest path for me was to install python, since the learning curve didn't look too steep and I could build a script bit by bit. The key for the data structure turned out to be replacing this part of my hypothetical .proto file:
message Record {
string id = 1;
ThisItem item = 2;
ThatItem item = 2; // Problem here, each record uses field 2, but with different message types.
// Each record has either a ThisItem or ThatItem. Parsing the id field could tell which,
// but that doesn't appear possible with protoc on the command line.
}
with a generalized form:
message Record {
optional string id = 1;
oneof datafields {
bytes data = 2;
ThisItem thisitem = 3;
ThatItem thatitem= 4;
}
}
The protobuf binary only uses the general bytes data structure, which is why protoc
with --decode_raw
shows all the data using the field number of 2. The data
field can then be a container for ThisItem
or ThatItem
as necessary. Those two structures are also included as possible datafields
so that the program record structure can accommodate them for programmatic manipulation.
Here is sample code for python, where the .proto
file is myschema.proto
, defined as shown in my question above with the :
import myschema_pb2
from google.protobuf import text_format
### Read objects from PB and load into RecordList
mylist=myschema_pb2.RecordList()
f=open('objects.pb','rb')
mylist.ParseFromString(f.read())
f.close()
### Parse general data into ThisItem or ThatItem
for rec in mylist.records:
bin1 = rec.data
ss=rec.id
itemID=ss[0:ss.find(':')]
if itemID == 'ThisItem':
rec.thisitem.ParseFromString(rec.data) # parses data into thisitem and clears data
elif itemID == 'ThatItem':
rec.thatitem.ParseFromString(rec.data) # parses data into thatitem and clears data
else:
print('unknown')
Thisitem and thatitem can then be manipulated as needed. When it's time to write the protobuf file they are converted back into the general data format:
### Generalize ThisItem and ThatItem into data
for rec in newlist.records:
ss=rec.id
itemID=ss[0:ss.find(':')]
if itemID == 'ThisItem':
rec.data=rec.thisitem.SerializeToString()
elif itemID == 'ThatItem':
rec.data=rec.thatitem.SerializeToString()
else:
print('unknown')
Note again, this structure is just peculiar to the protobuffer I've been working with. I'm not sure why the developer decided to do it like this, rather than write thisitem and thatitem to the binary. As far as I know, all it changes is the field number, 2, 3 or 4.
Upvotes: 1
Reputation: 161
Let me see if I can help answer this question.
protoc --decode_raw
is a dead end road, you cannot use protoc to encode later. This is because there is no such thing as --encode_raw. You cannot have a proto file with messages named 1,2,3 etc... it does not work. However if can set up a schema to match the data you can feed protoc to encode or decode easily.
This is text used to encode a message, I have saved this into a file named message
for my example below.
records{
id: '1'
item{
id: '1.1'
element2: 'e2'
element4: 'e4'
element5: 5
}
}
records{
id: '2'
item{
id: '2.1'
element2: 'e2'
element4: 'e4'
element5: 5
}
}
I named this file test.proto in example below.
syntax = "proto3";
message RecordList {
repeated Record records = 1;
}
message Record {
string id = 1;
ThisItem item = 2;
}
message ThisItem {
string id = 1;
string element2 = 2;
string element4 = 4;
int32 element5 = 5;
}
# Encode message and then decode it with our schema
protoc --encode="RecordList" --proto_path= ./test.proto < message | protoc --decode="RecordList" --proto_path= ./test.proto
Output:
records {
id: "1"
item {
id: "1.1"
element2: "e2"
element4: "e4"
element5: 5
}
}
records {
id: "2"
item {
id: "2.1"
element2: "e2"
element4: "e4"
element5: 5
}
}
At this point you could modify this output to have different values that you want and encode it. Using a hex dump tool like hd or xxd can be very useful as well!
# Send output from protoc decoding to be encoded again with different
# Lets say you saved the output to out.txt
protoc --encode="RecordList" --proto_path= ./test.proto < out.txt
# You can always decode your output to see the formatted parsed version
protoc --encode="RecordList" --proto_path= ./test.proto < out.txt | protoc --decode="RecordList" --proto_path= ./test.proto
I would like to see the raw output of your --decode_raw
command, even seeing the payload as a hex string like 08-24-AF-22
would be helpful.
I can't really suggest anything on the variance of the message types without seeing that.
For example I know that negative numbers of type int32 will show up as unsigned large integers with --decode_raw
. I am not sure what may be going on with your case without seeing the raw data more closely.
But yes I would recommend tying to combine the messages ThisThing and ThatThing. Not including every message field in every message is very common.
If you have not found this online decoder
by Marc Gravell then be sure to try it. It is essentially protoc --decode_raw
with more detail.
Upvotes: 2