Archive for the ‘dbx’ Category

Debugger Design

Friday, March 20th, 2009

I’ve spent a number of years in the dbx group at Sun, and over time you collect a lot of coulda-woulda-shoulda stories.  You know what I mean, “This code should really have been designed to do XYZ.”  Or “This module shouldn’t have to talk to that module.”  I figured I’d try to record some of the interesting bits for posterity, so I wrote an essay that I vaingloriously call a whitepaper.  So without further ado:

Goodbye Solaris 9 (for Sun Studio)

Wednesday, August 20th, 2008

We’re making the internal transition to building Sun Studio on Solaris 10 (instead of Solaris 9). This is a big deal because the product bits immediately become useless on any Solaris 9 system. There’s a new libm.so.2 library that became available on Solaris 10, and if you depend on it, you can’t run on Solaris 9. It’s a challenge making sure our vast ocean of loosely maintained lab machines is ready for the change. The good news is we get to use newer, faster hardware. 🙂

I’ll make this post short because I’m using ScribeFire for the first time in forever, and I don’t trust it. I can’t believe blogging is still this hard. 🙁

Using dbx and libumem to find memory problems

Friday, May 18th, 2007

Update: There is a new version of umem.dbx here (Solaris 9 support) Download it here.

I implemented a spiffy little dbx module to give basic access to libumem debugging features a while back, but I haven’t gotten much feedback on it.  Think of this blog entry like a Dunk Tank at the carnival. Throw a blog comment at me (like a bug or an RFE) and make me add some new features to my libumem dbx module.

Update March 9, 2010: It looks like Eric dunked me good.  (See the comments).  This module may or may not work for you. But it’s still probably useful as an example of how to write advanced dbx scripts.

Here are things I know to be true:

  • developers prefer dbx to mdb (unless you’re hacking on the kernel)
  • the RTC feature in Sun Studio is slow and sometimes buggy
  • memory allocation and access bugs are NASTY to track down
  • nobody has requested any new features for the libumem module

So the only conclusion I can draw is that not enough people know about this module yet.  🙂  With that in mind, here is the command line version of the ever-popular screenshot.

(dbx) source umem.dbx
(dbx) alias u=umem

(dbx) u start
Enabling libumem debugging

(dbx) cc -g t.c
(dbx) debug a.out
(dbx) list 1,$
    1   #include <malloc.h>
    2   int main()
    3   {
    4        char * p;
    5        p = malloc(1);
    6        p = malloc(1);
    7        p = malloc(1);
    8        free(p);
    9        p++;
   10        // this free will cause an error in libumem
   11        // if checking is on, because it's a bad free
   12        free(p);
   13   }
   14

(dbx) run
signal ABRT (Abort) in __lwp_kill at 0xff2bd5ec
0xff2bd5ec: __lwp_kill+0x0008:  bcc,a,pt  %icc,__lwp_kill+0x18  ! 0xff2bd5fc
Current function is main
   12        free(p);

(dbx) print p
p = 0x5bfa9 "\xad\xbe\xef\xde\xad\xbe\xef\xfe\xed\xfa\xce\xfe\xed\xfa\xce"

(dbx) u findblock 0x5bfa9

Building umem_syms helper library.
Address 0x5bfa9 is inside the umem block at 0x5bfa0.
   This corresponds to the malloc block at 0x5bfa8.

# So we can see that the pointer we tried to free points into
# the middle of a block.

# Let's ask for a history of the umem block that caused umem to barf.

(dbx) u bhist 0x5bfe0

=================================================================
       Log Rec Addr          Block Addr   Thrd         Timestamp
       ------------          ----------   ----         ---------
            0x320c8             0x5bfa0    1     0xbda1d488b8260
0x107d0 : in `a.out`_start   /* No debugging info */
0x10c1c : in `a.out`t.c`main at "t.c":7
0xff36aeb4 : in `libumem.so.1`malloc   /* No debugging info */
0xff36e2d0 : in `libumem.so.1`_umem_alloc   /* No debugging info */
0xff36de8c : in `libumem.so.1`_umem_cache_alloc   /* No debugging info */
=================================================================
       Log Rec Addr          Block Addr   Thrd         Timestamp
       ------------          ----------   ----         ---------
            0x3212c             0x5bfa0    1     0xbda1d488ba010
0x107d0 : in `a.out`_start   /* No debugging info */
0x107d0 : in `a.out`_start   /* No debugging info */
0x10c2c : in `a.out`t.c`main at "t.c":8
0xff36b214 : in `libumem.so.1`malloc.c`process_free   /* No debugging info */
0xff36dfec : in `libumem.so.1`_umem_cache_free   /* No debugging info */
=================================================================

# Don't ask me why _start shows up twice in libumem stack capture.
# It's probably a stray tail-call optimization someplace.

# If you want to see the recent history of allocations/frees, do this:

(dbx) u log
       Log Rec Addr          Block Addr   Thrd         Timestamp
       ------------          ----------   ----         ---------
            0x32000             0x5bfe0    1     0xbda1d488b15c8
            0x32064             0x5bfc0    1     0xbda1d488b6fa0
            0x320c8             0x5bfa0    1     0xbda1d488b8260
            0x3212c             0x5bfa0    1     0xbda1d488ba010

Latest Sun dwarf extensions

Friday, May 11th, 2007

I’ve been working with the Sun lawyers and the Dwarf Standards Committee recently to change the overly zealous license on the Sun Dwarf Extensions document.  I think we’ve finally gotten it down to something reasonable.  Anyway, we’ve added a few twists for C++ and Fortran 90.  As an example, there are some new structures for identifying code segments that implement C++ destructors, and correlating them to the object being destroyed. So with that introduction, here is the latest document.

dbx .ldynsym support – stack traces for stripped programs

Thursday, May 3rd, 2007

Stack traces for stripped programs should get easier to read on Solaris.  Solaris Nevada added a new strip-proof symbol table that inherits part of the symbols that normally get stripped out by the strip command. Basically static functions. Static functions have always been the number one cause of unreadable stack traces in stripped programs, but not the Solaris utilities (dtrace, mdb, pstack, etc) and also dbx (in Sun Studio 12 FCS) will be able to make use of these new symbols. The new Solaris feature is described here.

Before:

% cc t.c && strip a.out && ./a.out ; dbx -c 'where;quit' - core
"t.c", line 2: warning: implicit function declaration: abort
Abort (core dumped)
Corefile specified executable: "/home/quenelle/a.out"
Reading a.out
core file header read successfully
Reading ld.so.1
Reading libc.so.1
program terminated by signal ABRT (Abort)
0xbff80717: __lwp_kill+0x0007:  jae      __lwp_kill+0x15        [ 0xbff80725, .+0xe ]
=>[1] __lwp_kill(0x1, 0x6), at 0xbff80717
[2] _thr_kill(0x1, 0x6), at 0xbff7ded4
[3] raise(0x6), at 0xbff2ced3
[4] abort(0x8047408, 0x80506db, 0x8047514, 0x8047428, 0x805062d, 0x1), at 0xbff10969
[5] 0x80506c8(0x8047514, 0x8047428, 0x805062d, 0x1, 0x8047434, 0x804743c), at 0x80506c8
[6] main(0x1, 0x8047434, 0x804743c, 0x80505cf), at 0x80506db

After:


% cc t.c && strip a.out && ./a.out ; dbx -c 'where;quit' - core
"t.c", line 2: warning: implicit function declaration: abort
Abort (core dumped)
Corefile specified executable: "/home/quenelle/./a.out"
Reading a.out
core file header read successfully
Reading ld.so.1
Reading libc.so.1
program terminated by signal ABRT (Abort)
0xff344a24: __lwp_kill+0x0008:  bcc,a,pt  %icc,__lwp_kill+0x18  ! 0xff344a34
=>[1] __lwp_kill(0x0, 0xffffffff, 0x0, 0x0, 0xfffffffc, 0x0), at 0xff344a24
[2] raise(0x6, 0x0, 0x5, 0x6, 0xffffffff, 0x6), at 0xff2f7504
[3] abort(0xff386a80, 0x1, 0x6, 0xff3836c0, 0xacf34, 0x0), at 0xff2d3824
[4] baz(0x0, 0x1000, 0xff385ac0, 0xff3a2000, 0x0, 0x4), at 0x10e6c
[5] main(0x1, 0xffbff3e4, 0xffbff3ec, 0x21000, 0xac71c, 0xff3a0140), at 0x10e9c

Stopping right before your crash

Thursday, March 15th, 2007

I just got this question from Steve down the hall.  It’s so mind-bogglingly useful, that everyone who uses dbx needs to know how to do it.  The scenario goes like this. Your program crashes in strcpy sometime in the middle of the run, after about a zillion calls to strcpy.  But you want to stop on entry to the strcpy call which is going to cause the crash.  Here’s how to do it.

(dbx) stop in strcpy -count infinity
(dbx) run
[[ crash ]]
(dbx) status
 (2) stop in strcpy -count 2435/infinity 
(dbx) delete all
(dbx) stop in strcpy -count 2435
(dbx) run
(dbx) where
[[ stopped right before the crash ]]

Of course, you can do the same thing in gdb, but I’ll leave that as an exercise for the reader. 🙂

Dwarf and XML

Tuesday, December 19th, 2006

I’ve been having a hard time sifting through huge dwarf dump files in the last year or so, especially some of the huge dumps from the C++ standard template library. (Blech) So I’ve been working on a side project to let me do more powerful queries on dwarf information.  The part of the dwarf information that I usually have to sort through is the .debug_info section.  It’s essentially an abstract syntax tree of all (or part) of the information in the object file. In order to make it easier to sift through, I’ve started to write an XML dumper for this information, so that I get information something like:

     <t:namespace id='1178'>
      <name          string ='1'>std     </name>
      <SUN_link_name string ='1'>__1nDstd_</SUN_link_name>
      <sibling       ref4   ='1643'/> <!--__rwstd-->
      <t:structure_type id='1197'>
         <name               string ='1'>char_traits&lt;char&gt;</name>
         <SUN_part_link_name string ='1'>nLchar_traits4Cc_</SUN_part_link_name>
         <decl_file          data1  ='3'/>
         <decl_line          data1  ='182'/>
         <SUN_template       ref4   ='1247'/> <!--char_traits-->
         <declaration        flag   ='1'/>
         <t:template_type_parameter id='1241'>
            <type ref4 ='883'/> <!--char-->
         </t:template_type_parameter>
      </t:structure_type>

Instead of the usual dwarfdump form, which is:

<1>< 1178>      DW_TAG_namespace
                DW_AT_name                  std
                DW_AT_SUN_link_name         __1nDstd_
                DW_AT_sibling               <1643>
<2>< 1197>      DW_TAG_structure_type
                DW_AT_name                  char_traits<char>
                DW_AT_SUN_part_link_name    nLchar_traits4Cc_
                DW_AT_decl_file             3 /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/rw/traits
                DW_AT_decl_line             182
                DW_AT_SUN_template          <1247>
                DW_AT_declaration           yes(1)
<3>< 1241>      DW_TAG_template_type_parameter
                DW_AT_type                  <883>

The XML format is still preliminary, but it lets me play around with using the XQuery language for searching the XML and extracting pieces of it.  (I could also use XSLT, but XQuery is a little better for joins and more complex searches.) XQuery includes as a subset the XPath syntax.  I’m sure all this is just a bunch of gobbledy goop unless you already know some of this stuff, so here is an example:

In XPath, you can select all the XML nodes in a document based on what their parents are, for example:

//namespace/struct

This XPath expression would select all the “struct” XML nodes that are children of “namespace” nodes.

Using XQuery I wrote a simple script to dig out all the elements with a specific name, and show the names of the containers that are their ancestors.  The pathname to Mukesh’s source tree makes a featured appearance here because that’s where got my sample debug information from, it started while I was trying to track down a bug in the debug info for libCstd.

% ruby dwcmd.rb dwarf xgrep findname dw.xml __unLink
<?xml version="1.0" encoding="UTF-8"?>

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/string.cc - 11
   std - 1120
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 1827
   __unLink - 2455

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/string.cc - 11
   std - 1120
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 1771
   __unLink - 2201

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/ostream.cc - 11
   std - 1121
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 2735
   __unLink - 2926

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/ostream.cc - 11
   std - 1121
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 2806
   __unLink - 2997

As you can see, an item named “__unLink” shows up 4 times.  I extended the script to allow you to filter which items you wanted to see based on the names of their containers.  So when I search for “ostream:__unLink” the script will only show me items named __unLink that are within items that have “ostream” in the name.

% ruby dwcmd.rb dwarf xgrep findname dw.xml ostream:__unLink
<?xml version="1.0" encoding="UTF-8"?>

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/ostream.cc - 11
   std - 1121
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 2735
   __unLink - 2926

   /set/c++/cafe8/mkapoor/lang5.9/libCstd.2.1.1/include/ostream.cc - 11
   std - 1121
   basic_string<char,std::char_traits<char>,std::allocator<char> > - 2806
   __unLink - 2997

Pretty cool, huh?

Anyway, that’s as far as I got. There’s always more compiler bugs to fix, so I don’t get much time to work on infrastructure and internal tools. Maybe I’ll get some more hacking done over the holidays. XML feeds into some of my areas of technical curiosity, like RDF, RDFA, SPARQL, FOAF, etc.

Importing debug information into dbx

Monday, November 14th, 2005

I’m sure I wrote this up somewhere before, but now I can’t find it. Just in case you guys (my two faithful readers) haven’t seen this trick yet. If you are stuck with a core file that doesn’t have debug information, you can “import” debugging information using the “loadobject -load” command. It’s especially useful for C++ to help get rid of the mangled names that show up in stack traces.

% #########################
% more t.c

#include "t.h"

struct foo foofoo;

int
main()
{
    foofoo.a = 1;
    foofoo.b = 2;
    * (int *) 0 = 0;
}

% #########################
% more t.h

struct foo {
int a;
int b;
};

% #########################
% cc -o t t.c # no debug info

% #########################
% ./t
Segmentation Fault (core dumped)

% #########################
% dbx t core
Reading t
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libc_psr.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x00010bb4: main+0x001c:   clr      [0]
(dbx) whatis foofoo
dbx: warning: unknown language, 'c' assumed
(int {assumed}) foofoo;
(dbx) print foofoo
foofoo = 0x1
(dbx) whatis -t foo
dbx: "foo" is not defined in the scope `t`main`
dbx: see `help scope' for details
(dbx) quit

% # 
% #  You really want to see the contents of the 'foofoo'
% #  structure, but the binary doesn't have debug info!
% #  So create a dummy .so file with debug info, and load
% #  that into dbx manually.
% # 

% #########################
% more dummy.c

#include "t.h"

% #########################
% cc -G -g -o dummy.so dummy.c

% #########################
% dbx t core
Reading t
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libc_psr.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x00010bb4: main+0x001c:   clr      [0]
(dbx) loadobject -load dummy.so
Reading dummy.so
Loaded loadobject: /set/dbx/somewhere/misc/coretest/dummy.so
(dbx) modules | grep dummy
Not Read  dummy.o
(dbx) module dummy.o
Read      dummy.o
(dbx) whatis -t foo
struct foo {
    int a;
    int b;
};
(dbx) print *(struct foo*)&foofoo
dbx: warning: unknown language, 'c' assumed
*((struct foo *) &foofoo) = {
    a = 0x1
    b = 0x2
}

The story of lazy stabs

Thursday, October 27th, 2005

There’s a dbx feature called “lazy stabs” that is clever, but a little confusing sometimes. I figured I’d talk about it a little to give an overview of what’s happening. There are really two parts to the idea of “lazy stabs”, one part is something we do all the time (demand loading most information when you first visit a source file), and the other part depends on how you compile your code (most debug info can be aggregated into the a.out or it can be left in the .o files).

You can read more about stabs and dwarf here:

stabs versus dwarf

Index debug info (stabs and dwarf)

Dbx will always demand-load line number information and other information about local symbols. So until you visit a source file for some reason, dbx won’t bother loading the majority of the debug information for that file. This makes dbx start up much faster. When you first load the binary, only the global symbols and other index information are loaded.

For example, if you want to stop at “foo.h:12” then dbx needs to load the detailed source information for all files that include code from foo.h. Then dbx figures out which object files have code from line 12, and sets the breakpoint(s) you need.

In stabs, the index information is stored in the .stab.index section. In dwarf, the index information is stored in multiple sections with names like: .debug_pubnames, .debug_varnames etc.

reading debug info from .o files (stabs only)

Stabs were carefully designed not to depend on relocation records (which need to be resolved by the linker).

For most functions and variables, stabs uses the linker name of the function or variable to represent that object. At runtime dbx will access the global symbol table in the a.out and look up the symbol by name. For C++, this process uses the linker name of the symbol, and the character strings recorded as part of stabs can get very very huge. A few releases ago, we started using a compressed form of mangled names, which helped somewhat.

Because of that design, dbx can read most stabs from a .o file, and make sense of them. (Of course, the index stabs still come from the a.out.) This allows the a.out to have a smaller size on disk. Note that having stabs in a program never affects the run-time size of a program, or it’s performance, because stabs are not loaded at runtime.)

This has bitten some people working on mozilla in the past: https://bugzilla.mozilla.org/show_bug.cgi?id=146154

Dwarf information encodes the absolute addresses of functions and variables, and so it needs to be relocated by the linker in order to make sense. That means we can’t support this aspect of “lazy stabs” (really it’s better called “dispersed stabs” or something like that). The a.out has to include all the dwarf information for the program. In exchange for this, dwarf takes up significantly less space in C++ programs that use long mangled names.

Larger a.outs (dwarf, or stabs with -xs)

When using stabs, you can compile with the -xs flag which will tell the compiler to collect all the stabs into the a.out. Dbx will still demand-load them, but it works better if you want to archive the binary with debug information, or if you want to clean up your build area, but keep a debuggable binary. When dbx is loading stabs from the .o files, if you move the directory that has the .o files in it (or move the .o files themselves), then you have to use the pathmap command in dbx to tell dbx where they went to.

(Aside: You might think -xs would logically be used at link time, but you need to use -xs at compile time so the compiler can tag the stabs sections with a flag that means “accumulate into a.out”. This flag causes the linker to aggregate the stabs at link time. )

The increase in a.out size with dwarf will probably be a surprise to people who are used to smaller a.out’s when using stabs, but I’m personally looking forward to it. I’ve had to deal with many many users over the years who wanted to send me a binary with debug information in order to reproduce a bug, but the a.out is normally missing the majority of the debug information with stabs. I had to tell them to rebuild their program with the -xs option, or else tar up the entire build tree and send it to me. With dwarf, that problem won’t come up again. Everything will be in the a.out.

One important thing to remember is that the stabs and dwarf information isn’t ever loaded into your program when it’s run. So it won’t affect the runtime performance or take up any memory when your program runs. The information only takes up disk space. And it’s disk space that is also taken up by the .o files. So if you previously were saving your object files so that you could debug your program, you can stop doing that if you are using a compiler that emits dwarf. It also makes it easier to keep the non-stripped version of a binary that you strip and ship as part of a product.

MAP_NORESERVE and dbx

Tuesday, October 4th, 2005

So your program has Obj * p; in it, and you stop in dbx and say print *p Dbx comes back and says the memory is illegal or unmapped. But then you continue your program, and your program reads from *p just fine. What’s going on?

This gotcha showed up again on a Sun alias. It’s been known to happen with a large database application whose initials are O. It’s not really a bug in dbx or a bug in the program, it’s just a slightly confusing “feature” in the Solaris.

Solaris allows you to call mmap to allocate memory in the address space, and it supports an option called MAP_NORESERVE. When you supply this option, the kernel will create a range of virtual memory that can be accessed by the program, but this memory will be filled with zeros on demand. Swap space is not reserved for the memory until it’s used, which makes it great for very large arrays that are only sparsely filled in.

Unfortunately, this creates memory which is perfectly valid from the program’s pont of view, but which Solaris (and the /proc interface) doesn’t admit is really there, when dbx asks about it.

In theory this bug could be “fixed” by modifying the /proc interface in Solaris so that it allows reading from all NORESERVE addresses and returns a value of 0. I suppose the kernel should prevent /proc from writing to any of these locations unless the user made some explicit statement that they wanted to modify the memory map of the target process.

It’s usually not a big deal to work around, you just have to be aware that dbx will tell you something is unmapped, even though the program might correctly be able to read/write to that location.