[Dev] SH2 disassembler
#1
[Dev] SH2 disassembler
(I suspect I can count the number of people who will be interested in this post on one hand, but since the project is far enough along that it produces semi-useful output, it's probably time to toss it over the wall and see if anyone cares. So, here we go...)
I've written a rudimentary, but automated, SH2 disassembler in Python. It's licensed under the GNU General Public License (Version 3), and can be downloaded from here:
https://github.com/logic/sh2dis/
(You'll need all files from that folder ending in ".py", or just click on "zip" in the left-hand sidebar to download a zip file of the whole project.)
This is not an IDA replacement, at least not yet, although my motivation was not having to point folks who are interested in ECU development toward a $1000 code analysis package that they'll only end up using 1% of.
In fact, as of now it has no user interface at all, simply a "demo application" (dis.py) that, given a ROM image from an Evo VIII or IX (and probably most other 7052- or 7055-based platforms, such as the Hayabusa ECU), tries to perform an automatic disassembly in much the same manner as acamus' onload.idc script. For instructions on using dis.py, run "dis.py --help".
Segment handling is modeled after IDA, and I've tried not to torpedo the possibility of implementing other processors (I'm thinking specifically of H8/500 and HC11, for obvious reasons), but I just haven't had the time to think about that yet. The output doesn't currently include IDA's comment-based cross-references, although that information is tracked and could be added pretty easily. It automatically labels "known" (ie. from the platform docs) vectors and registers, and can follow most branches. Branch handling is done by doing very basic register assignment tracking, and there's a ton of room for improvement here (but it seems to be good enough for "in the wild" ROMs right now).
It requires Python 2.5 or 2.6. Python 3.0 will not work, period, full-stop; there's too much new stuff to make this a non-trivial porting exercise right now.
In case it's not completely obvious yet: this is NOT end-user software. The target audience for this is other developers right now, and probably only those with a solid working knowledge of both IDA and python. Knowing SH2 assembly wouldn't hurt, either.
Performance is not quite where I'd like it to be right now; it takes about 30 seconds on my old dev machine (Dual PIII 1GHz, Linux) to run through a complete disassembly and output, which feels a bit slower than IDA's automated analysis. I'll be very honest, I'm not worrying much about that just yet, since there's so much additional work to be done elsewhere. (If someone feels like tackling the main bottleneck, it's in sh2.py, in disasm_single(); a short-circuiting instruction matching scheme in there, perhaps along with better opcode storage in sh2opcodes.py, would probably cut runtime by more than half.)
It's probably extremely buggy, and the source is certainly a mess as it sits right now. Bug reports and patches are welcome.
I've written a rudimentary, but automated, SH2 disassembler in Python. It's licensed under the GNU General Public License (Version 3), and can be downloaded from here:
https://github.com/logic/sh2dis/
(You'll need all files from that folder ending in ".py", or just click on "zip" in the left-hand sidebar to download a zip file of the whole project.)
This is not an IDA replacement, at least not yet, although my motivation was not having to point folks who are interested in ECU development toward a $1000 code analysis package that they'll only end up using 1% of.
In fact, as of now it has no user interface at all, simply a "demo application" (dis.py) that, given a ROM image from an Evo VIII or IX (and probably most other 7052- or 7055-based platforms, such as the Hayabusa ECU), tries to perform an automatic disassembly in much the same manner as acamus' onload.idc script. For instructions on using dis.py, run "dis.py --help".
Segment handling is modeled after IDA, and I've tried not to torpedo the possibility of implementing other processors (I'm thinking specifically of H8/500 and HC11, for obvious reasons), but I just haven't had the time to think about that yet. The output doesn't currently include IDA's comment-based cross-references, although that information is tracked and could be added pretty easily. It automatically labels "known" (ie. from the platform docs) vectors and registers, and can follow most branches. Branch handling is done by doing very basic register assignment tracking, and there's a ton of room for improvement here (but it seems to be good enough for "in the wild" ROMs right now).
It requires Python 2.5 or 2.6. Python 3.0 will not work, period, full-stop; there's too much new stuff to make this a non-trivial porting exercise right now.
In case it's not completely obvious yet: this is NOT end-user software. The target audience for this is other developers right now, and probably only those with a solid working knowledge of both IDA and python. Knowing SH2 assembly wouldn't hurt, either.
Performance is not quite where I'd like it to be right now; it takes about 30 seconds on my old dev machine (Dual PIII 1GHz, Linux) to run through a complete disassembly and output, which feels a bit slower than IDA's automated analysis. I'll be very honest, I'm not worrying much about that just yet, since there's so much additional work to be done elsewhere. (If someone feels like tackling the main bottleneck, it's in sh2.py, in disasm_single(); a short-circuiting instruction matching scheme in there, perhaps along with better opcode storage in sh2opcodes.py, would probably cut runtime by more than half.)
It's probably extremely buggy, and the source is certainly a mess as it sits right now. Bug reports and patches are welcome.
Last edited by logic; Apr 10, 2011 at 12:17 PM. Reason: New URL.
#5
Yeah, I can't blame the guy writing IDA (it's really just one fellow, at the core of it) for the pricing; it's such a niche market that it's tough to stay in business if you don't charge appropriately. (Really, who is your target audience? Antivirus/security software developers, pirates, and hobbyists. But only one of those groups has money to spend on software, and another group is going to actively try to redistribute your software. )
And codgi hit the nail on the head; especially for low-volume sales like this, a "non-commercial use" or "student" license doesn't really work, because when you need it, it's probably a single-project need. It'd be like one of those companies selling data recovery software giving away a trial version that does a few recoveries before you have to buy it; they'd never get any sales, because you generally only go looking for software like that when you have a single recovery to do.
But that still means we need something to work with, and I'd rather tell people "here, use this free thing that has a few rough edges" than give them a suggestion that's just going to lead to them hopping on The Pirate Bay. Unfortunately, it has a LOT of rough edges right now. Working on it.
And codgi hit the nail on the head; especially for low-volume sales like this, a "non-commercial use" or "student" license doesn't really work, because when you need it, it's probably a single-project need. It'd be like one of those companies selling data recovery software giving away a trial version that does a few recoveries before you have to buy it; they'd never get any sales, because you generally only go looking for software like that when you have a single recovery to do.
But that still means we need something to work with, and I'd rather tell people "here, use this free thing that has a few rough edges" than give them a suggestion that's just going to lead to them hopping on The Pirate Bay. Unfortunately, it has a LOT of rough edges right now. Working on it.
Trending Topics
#9
I did mention that it was a PIII 1.0GHz, right? But, it's a handy little machine to ssh over to for stuff like this; keeps work and play separate.
Just in case anyone would rather just check it out directly, the git repository is available at: https://github.com/logic/sh2dis
Just in case anyone would rather just check it out directly, the git repository is available at: https://github.com/logic/sh2dis
Last edited by logic; Apr 10, 2011 at 12:19 PM. Reason: Moved to github.
#11
Evolved Member
iTrader: (6)
I already have some patches for you. No change in functionality yet, but some improvements to the interface, documentation, that kind of stuff.
Example: you don't want to be sending the output to stdout if you are calling diassemble() from the Profile module.
Do you want them?
d
Example: you don't want to be sending the output to stdout if you are calling diassemble() from the Profile module.
Do you want them?
d
#12
Patches are always welcome. I've been leaving some of the "fit and finish" stuff off to the side while I've been writing and re-writing the segment API and working out how best to integrate tests for instruction disassembly and register tracking. (It's nice working on something that actually lends itself well to automated testing for a change. )
#13
Well, would you look at that; the output is starting to look familiar now:
This is just some candy (because I'm avoiding working on more important things), but it's neat to see the underlying construction start to show through in the output; this little snippet really demonstrates what it can do now.
I have a whole new level of respect for IDA and Hex-Rays at this point.
Code:
00009C5C init: mov.l @(0x14,pc),r15 ! [unk_9C74] = sp ! XREF: v_power_on_pc ! XREF: v_reset_pc ... 00009C5E mov.l @(0x18,pc),r0 ! [unk_9C78] = unk_FFFFABA0 00009C60 mov.l @(0x18,pc),r1 ! [unk_9C7C] = unk_FFFFABA0 00009C62 mov.l r1,@r0 00009C64 mov #0x0,r0 00009C66 ldc r0,vbr 00009C68 ldc r0,gbr 00009C6A mov.l @(0x14,pc),r0 ! [unk_9C80] = sub_EE94 00009C6C jsr @r0 ! sub_EE94 00009C6E nop 00009C70 bra reset 00009C72 nop ! ------------------------------------------------------------ 00009C74 unk_9C74: .long sp ! XREF: init 00009C78 unk_9C78: .long unk_FFFFABA0 ! XREF: 0x9C5E 00009C7C unk_9C7C: .long unk_FFFFABA0 ! XREF: 0x9C60 00009C80 unk_9C80: .long sub_EE94 ! XREF: 0x9C6A ! ------------------------------------------------------------ 00009C84 reset: mov.l @(0x8,pc),r0 ! [unk_9C90] = v_int_trap1C ! XREF: v_gen_ill_inst ! XREF: 0x14 ... 00009C86 ldc.l @r0+,sr 00009C88 mov.l @(0x8,pc),r0 ! [unk_9C94] = init 00009C8A jmp @r0 ! init 00009C8C nop
I have a whole new level of respect for IDA and Hex-Rays at this point.
#14
Happiness is progress.
dis.py now takes a "-m" command-line argument, which applies Mitsubishi-specific fixups to the ROM. Right now, that means it automatically locates/disassembles jump tables (as indicated by use of MOVA), and also tries to locate the MUT table (and seems to mostly succeed). Both are borrowed from acamus' onload.idc script; he should get the credit for the way I implemented determination of their locations.
At this point, the results appear to be almost as good as when I load up a virgin ROM in IDA, after onload.idc and sh3.cfg do their magic.
Next up: "empty space" scanning and .ORG directive generation (any block of "FF" larger than, say, five bytes, gets turned into a .ORG directive in the output), and output of any referenced RAM and hardware register addresses as .EQU directives (ie. "reg_PACRL .equ 0xFFFFF724"). Pretty soon, the output might actually be able to be fed back into gas directly.
Thinking further ahead, I'd like to add some kind of automated table extraction; ie. look for any "02 XX FF FF" and "03 XX FF FF" sequences that have references to them (or try to actually parse out sub_C28, sub_CC6, etc. calls; I'll need to get smarter about saving register data with generated code to pull that off, though), and at least label them as tables (tbl_XXXX, instead of unk_XXXX) or something. It's candy, but potentially useful, especially if I can bolt up the code I already have for parsing EcuFlash XML files, which would give me something to cross-reference auto-located tables against. I'll probably have to add some concept of an "array" of values at that point too, for auto-prettifying the output.
dis.py now takes a "-m" command-line argument, which applies Mitsubishi-specific fixups to the ROM. Right now, that means it automatically locates/disassembles jump tables (as indicated by use of MOVA), and also tries to locate the MUT table (and seems to mostly succeed). Both are borrowed from acamus' onload.idc script; he should get the credit for the way I implemented determination of their locations.
At this point, the results appear to be almost as good as when I load up a virgin ROM in IDA, after onload.idc and sh3.cfg do their magic.
Next up: "empty space" scanning and .ORG directive generation (any block of "FF" larger than, say, five bytes, gets turned into a .ORG directive in the output), and output of any referenced RAM and hardware register addresses as .EQU directives (ie. "reg_PACRL .equ 0xFFFFF724"). Pretty soon, the output might actually be able to be fed back into gas directly.
Thinking further ahead, I'd like to add some kind of automated table extraction; ie. look for any "02 XX FF FF" and "03 XX FF FF" sequences that have references to them (or try to actually parse out sub_C28, sub_CC6, etc. calls; I'll need to get smarter about saving register data with generated code to pull that off, though), and at least label them as tables (tbl_XXXX, instead of unk_XXXX) or something. It's candy, but potentially useful, especially if I can bolt up the code I already have for parsing EcuFlash XML files, which would give me something to cross-reference auto-located tables against. I'll probably have to add some concept of an "array" of values at that point too, for auto-prettifying the output.
#15
Evolved Member
iTrader: (5)
Happiness is progress.
dis.py now takes a "-m" command-line argument, which applies Mitsubishi-specific fixups to the ROM. Right now, that means it automatically locates/disassembles jump tables (as indicated by use of MOVA), and also tries to locate the MUT table (and seems to mostly succeed). Both are borrowed from acamus' onload.idc script; he should get the credit for the way I implemented determination of their locations.
At this point, the results appear to be almost as good as when I load up a virgin ROM in IDA, after onload.idc and sh3.cfg do their magic.
Next up: "empty space" scanning and .ORG directive generation (any block of "FF" larger than, say, five bytes, gets turned into a .ORG directive in the output), and output of any referenced RAM and hardware register addresses as .EQU directives (ie. "reg_PACRL .equ 0xFFFFF724"). Pretty soon, the output might actually be able to be fed back into gas directly.
Thinking further ahead, I'd like to add some kind of automated table extraction; ie. look for any "02 XX FF FF" and "03 XX FF FF" sequences that have references to them (or try to actually parse out sub_C28, sub_CC6, etc. calls; I'll need to get smarter about saving register data with generated code to pull that off, though), and at least label them as tables (tbl_XXXX, instead of unk_XXXX) or something. It's candy, but potentially useful, especially if I can bolt up the code I already have for parsing EcuFlash XML files, which would give me something to cross-reference auto-located tables against. I'll probably have to add some concept of an "array" of values at that point too, for auto-prettifying the output.
dis.py now takes a "-m" command-line argument, which applies Mitsubishi-specific fixups to the ROM. Right now, that means it automatically locates/disassembles jump tables (as indicated by use of MOVA), and also tries to locate the MUT table (and seems to mostly succeed). Both are borrowed from acamus' onload.idc script; he should get the credit for the way I implemented determination of their locations.
At this point, the results appear to be almost as good as when I load up a virgin ROM in IDA, after onload.idc and sh3.cfg do their magic.
Next up: "empty space" scanning and .ORG directive generation (any block of "FF" larger than, say, five bytes, gets turned into a .ORG directive in the output), and output of any referenced RAM and hardware register addresses as .EQU directives (ie. "reg_PACRL .equ 0xFFFFF724"). Pretty soon, the output might actually be able to be fed back into gas directly.
Thinking further ahead, I'd like to add some kind of automated table extraction; ie. look for any "02 XX FF FF" and "03 XX FF FF" sequences that have references to them (or try to actually parse out sub_C28, sub_CC6, etc. calls; I'll need to get smarter about saving register data with generated code to pull that off, though), and at least label them as tables (tbl_XXXX, instead of unk_XXXX) or something. It's candy, but potentially useful, especially if I can bolt up the code I already have for parsing EcuFlash XML files, which would give me something to cross-reference auto-located tables against. I'll probably have to add some concept of an "array" of values at that point too, for auto-prettifying the output.