Saturday, October 10, 2009

Quantifying the NetWatch performance hit

I was going through the Makefiles in preparation for adding 440BX support (for QEMU and Crashbox), and decided to skim the README, since I think I made Jacob write it, and I'd never read it in much detail! In the README, he said:

Because NetWatch is invisible to the OS, its CPU usage is difficult to monitor; we do so by comparing the MD5 throughput of the system with NetWatch running versus without. The only way that the OS could detect this performance drain is by spinning tightly and watching for a sudden jump in the CPU's time stamp counters.

I had considered this to be a problem before (indeed, when the system is actively doing a lot of VNC work, not only does processing power notably diminish, but the machine "feels" laggy!), but never really had a mechanism by which to quantify the issue. But rereading Jacob's notes -- in particular, the OS reading the TSCs -- I realized that the OS isn't the only one that could use the TSCs to quantify NetWatch performance. If we were clever, we could read the TSC when we enter and leave SMM each time. Since we come in every 64msec or so, it's unlikely that the TSC will overflow and provide an inaccurate result (it's 64 bits; to roll over in that timeframe would require a 288230 petahertz clock!).

The procedure, then, would be to measure the number of ticks since we last left SMM to when we enter SMM, and call that the amount of time spent by the OS. Then, we use the number of ticks from entering to leaving, and that's the amount of time spent by NetWatch. This should be highly accurate (although the number I'd really care about is a percentage with two significant figures).

I should note, by the way, that this isn't a generalized solution for other such performance problems on "real systems". This only works in NetWatch because we're guaranteed not to change the CPU clock for two reasons -- for one, the system is too old to have SpeedStep, and for the other, NetWatch clobbers the existing ACPI implementation, preventing Linux ACPI from coming up and changing the CPU clock, even if it were possible on this system.

On newer systems, probably the right answer would be to use the HPET, but on ICH2, the HPET simply isn't present! Also, I believe the HPET requires setup and board specific probing; for real OSes, this is done through ACPI, I believe.

Hopefully I'll get a chance to play with this tomorrow and come up with some numbers.

Saturday, October 3, 2009

Reanimation

Over the summer, Jacob did a lot of important contributory work to make NetWatch run on AMD64 and to add a GDB slave to NetWatch. Back at CMU, I've spent the last few hours reanimating NetWatch on ICH2. I now have pushed changes to the Git repository; it should build again! Of note, NetWatch can now display all of the registers that SMM provides, and the backtrace is now quite a bit more robust. I have not integrated the GDB support into the ICH2 build yet.

Also on an exciting note, NetWatch had its first application as a real debugger recently. Sully, in the progress of his 15-412 project (a network driver for Pebbles), found a case in which the driver worked in QEMU but instantly grenaded on hardware. Luckily, his 412 test system is the same as the NetWatch test system, so we booted his kernel in NetWatch, and got a backtrace when the fireworks came. Sure enough, the backtrace pointed exactly to where the system blew up! Sully shook my hand for producing a useful tool, and punched me for doing it with System Management Mode.

On a final note, I am considering writing a paper on NetWatch. Stay tuned.

Sunday, December 14, 2008

Saturday, December 13, 2008

Generally, the wisdom is...

Generally, the wisdom when you say, "I need a foo" for some common value of foo is to not write your own, but instead use a preexisting one.

But we would like to offer some new wisdom. The first step, we believe, when you want an IP stack is to burn lwIP to the fucking ground.

Friday, December 12, 2008

Keyboard

Jacob and I had a mini-sprint this afternoon, and figured out some things about keyboards, magic numbers, i8042s, UNGET commands (which make life easier), keyboard performance, linking VNC into the keyboard routines, and grub crashes. Major progress has been made in almost all of those areas, except for the magic numbers area, in which more magic is there.

Tomorrow: differential updates for better user performance while running X, and text console framebuffer emulation. ... and P3 style files.

Tonight: sleep.

Code, as always, is in git.

Sunday, December 7, 2008

Big changes

Big news recently. A week or so ago, Jacob got the VNC/RFB server talking to a client. Performance wasn't great, but we had output. Now, we've done two things that make a huge difference in terms of performance:

  • The network driver no longer waits for packets to finish getting sent or received, and queues up packets as needed, like modern network drivers do. We actually use the full capabilities of the card's bus-master DMA interface, and boy howdy does it pay off; we actually get ping performance in the 'as expected' region, where all pings are handled essentially under 64msec.

  • The big win, though, came from something very silly -- turning on the cache! You're supposed to keep SMRAM cacheless when you're outside of SMM, so that the user cannot get in your way and fill your cache with bogons (and hence have you crash in SMM). But if you want performance, you need to turn it on when you get in SMM. Finally, we can actually get the blazing fast speed that our 1GHz P3 promises. This brought us from about 3 minutes per frame to 15 seconds per frame; and the network driver performance improvements are incrementally helping that.


This is very exciting. VNCviewer now claims that we get good enough throughput to request hextile encoding (instead of ZRLE encoding) -- it thinks we're getting 4.5MByte/sec from the machine. Of course, we haven't implemented hextile (or ZRLE), but... It may be time to declare this performance fun to have come to an end, at least until we get mouse and keyboard support implemented.

Tuesday, November 25, 2008

Framebuffer access

OK, we now have unaccelerated access to the framebuffer. Maybe DMA soon if I have time. j4cbo is making good progress on RFB. In the interim, here's a picture for you, showing the venerable test phrase given to bootstrap all graphic demos that I've produced to date:



As usual, code is in Git.