Reputation: 73
I want to write a program that parses yum config files. These files look like this:
[google-chrome]
name=google-chrome - 64-bit
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub
This format looks like it is very easy to parse, but I do not want to reinvent the wheel. If there is an existing library that can generically parse this format, I want to use it. But how to find a library for something you can not name? The file extension is no help here. The term ".repo" does not yield any general results besieds yum itself.
So, please teach me how to fish: How do I effectively find the name of a file format that is unknown to me?
Upvotes: 1
Views: 686
Reputation: 3477
Identifying an unknown file format can be a pain. But you have some options. I will start with a very obvious one.
Showing other people the format is maybe the best way to find out its name. Someone will likely recognize it. And if no one does, chances are good that you have a proprietary file format in front of you.
In case of your yum repository file, I would say it is a plain old INI file. But let's do some more research on this.
Reverse Engineering maybe your best bet if nobody recognizes your format. Take the reference implementation and find out what they are using to parse the format. Luckily, yum is open source. So it is easy to look up. Let's see, what the yum authors use to parse their repo file:
try:
ini = INIConfig(open(repo.repofile))
except:
return None
https://github.com/rpm-software-management/yum/blob/master/yum/config.py#L1304
Now the import of this function can be found here:
from iniparse import INIConfig
https://github.com/rpm-software-management/yum/blob/master/yum/config.py#L32
This leads us to a library called iniparse (https://pypi.org/project/iniparse/). So yum uses an INI parser for its config files.
I will show you how to quickly navigate to those kind of code passages since navigating in somewhat large projects can be intimidating.
I use a tool called ripgrep (https://github.com/BurntSushi/ripgrep).
My initial anchors are usually well known filepaths. In case of yum, I took /etc/yum.repos.d
for my initial search:
# assuming you are in the root directory of yum's source code
rg /etc/yum.repos.d yum
yum/config.py
769: reposdir = ListOption(['/etc/yum/repos.d', '/etc/yum.repos.d'])
yum/__init__.py
556: # (typically /etc/yum/repos.d)
This narrows it down to two files. If you go on further with terms like read
or parse
,
you will quickly find the results you want.
Well, sometimes, you have no access to the source code of a reference implementation. E.g: The reference implementation is closed source. Try to break the format. Insert some garbage and observe the log files afterwards. If you are lucky, you may find a helpful error message which might give you hints about the format. If you feel very brave, you can try to use an actual decompiler as well. This may or may not be illegal and may or may not be a waste of time. I personally would only do this as a last resort.
Upvotes: 1