Thursday, June 28, 2007

Cost-based vacuum delay caveat in Postgres

I've been trying to vacuum a 25M row table in Postgres and it has been taking forever; we're talking over 22 hours (thought I'd let it run as I flew to Philadelphia for this conference). A bit of Googling turned up this thread:

VACUUM ANALYZE taking a long time, %I/O and %CPU very low

This guy was seeing the same behaviour as I was: VACUUM ANALYZE was taking forever, and CPU and I/O percentages were hovering around 0. He had the "vacuum_cost_delay" parameter set to 70, which means that Postgres will go to sleep for 70ms when it determines that the I/O costs have exceeded a certain limit ("vacuum_cost_limit"). Since a 25M row table isn't going to fit into memory, there's going to be a good deal of reading in blocks from the disk, and thus you're going to regularly exceed your delay threshold.

Somehow I had set my delay to 500ms. No wonder it was taking so long. I dropped it down to 0, effectively disabling the cost-based delay feature. Now, 10 minutes later, my table has been vacuumed and analyzed.

Now, you can use the autovacuum daemon to vacuum your tables, and the pg_autovacuum table (where you specify table-specific vacuum parameters) will let you set a value for vacuum_cost_delay. Thus, you can set the attribute "vac_cost_delay" to 0 to get quick autovacuums of your big tables, while still allowing you to set a system-wide vacuum_cost_delay for other smaller, less critical tables. It looks like if you manually kick off a vacuum, though, it still uses the system-wide defaults, instead of the values from pg_autovacuum (why?). Since you can set vacuum_cost_delay without reloading the server, if you need to do a manual vacuum, do a
SET vacuum_cost_delay = 0;
first (or something higher than 0 if you can't afford to peg your disk I/O), and then VACUUM (remembering to set vacuum_cost_delay back to what it was afterwards!). If you do this from the commandline, you might want to write a small wrapper script that will do this instead of running vacuumdb.

The lesson here? Always read the directions, kids.

Monday, June 25, 2007

Transformers: Members of the Coalition of the Willing?

I'm going to a conference tomorrow, and decided to check on the TSA's website to make sure I wasn't going to be breaking any of their wonderfully inane rules, like bringing 4 oz. of shampoo (horrors!) in my carry-on luggage.

I was quite surprised to find that they specifically allow "Toy Transformer Robots" (scroll down near the bottom). Even without that, Megatron would still be OK, because toy guns (so long as they don't look like real guns) are cool.

Furthermore, meat cleavers are prohibited by name in carry-on luggage (come on, you ban sabers and swords, ninja stars, and ice picks, and with all that, you still have to call out meat cleavers?!?!)

I'm glad our government is hard at work protecting us from Shampoo Bombers and insane butchers, but alas, they are falling behind in preventing the impending robot invasion!

Thursday, June 21, 2007

Unreal

If anyone ever doubts that the Internet can truly be a powerful democratizing force in the world, where the average person can say something and have it matter, check this out.

I started this blog last week. I've never blogged anywhere before, and a search for my name on Google isn't going to bring up any significant hits to me (except now for this post I'm about to talk about!). In other words, I'm not a "big voice" on the Internet.

A few days ago, I posted my third blog post ever to this free Blogger account. I wrote about how I liked David Weinberger's book Everything is Miscellaneous, and made an observation about how the themes he develops tie into what I work with, namely the human genome. Nothing big, maybe a little insightful (I thought it was neat, anyway). I wasn't really writing "for" anyone... this blog is just a place I can write some of my own thoughts down, and if that might be useful or interesting to someone somewhere, then all the better.

Today I'm sifting through my newsfeeds, and I see that David Weinberger has linked to my post on the main page of his book's website.

Think about that for just a second.

Thanks to the infrastructure that has been built up surrounding the Internet (Google indexing, Technorati blog indexing, folksonomic tagging, etc.), the words that I wrote were found and read by the author of the book I was talking about. This isn't a top-down organization, either: there aren't professional indexers, catalogers, and abstractors out there reading and organizing everything that gets published online. This is truly bottom-up organization, growing organically out of the miscellaneous pile of information we're growing online: the content, the usage patterns, the metadata—everything. Nobody needs to see that "Ah, Christopher Maier has published a post on "Everything Is Miscellaneous." We need to properly file his post in the "Everything is Miscellaneous" bin (or was it the "genomics" bin, or...)". Furthermore, very few people, in the grand scheme of things, are going to particularly care that I've done such a thing. However, for the people that would care about it and are looking for something about Everything Is Miscellaneous, or genomes or whatever else I talk about, this infrastructure presents it to them, as if by magic.

It is difficult, if not downright impossible, to see kind of thing happening prior to the advent of the Internet. And it's really exciting to see where this will ultimately lead.

Monday, June 18, 2007

Rodenbach

I recently discovered the joy of Flemish sour ale. That's some damn fine beer.

Sunday, June 17, 2007

The Genome Is Miscellaneous

Hopefully by now you have read David Weinberger's Everything Is Miscellaneous: The Power of the New Digital Disorder. It's quite an interesting and absorbing read, one of those books that makes you look at the world just a bit differently. I seem to be doing that an awful lot lately, finding unexpected applications of Weinberger's thesis all over the place. The latest? The human genome!

The ENCODE Project just published its findings from a detailed investigation of 1% of the human genome, and it looks like it's waaaaaaaaaay more complex and interesting than we thought. There's the main article (DOI: 10.1038/nature05874) in the current issue of the journal Nature, and a whole slew of additional articles in this month's Genome Research. I've been working through Gerstein, et al.'s What is a gene, post-ENCODE? History and updated definition (DOI: 10.1101/gr.6339607) for a very absorbing look at how our notion of a "gene" has changed dramatically in the years since Mendel and his peas, and where our understanding of "gene" stands in light of this exciting new data from ENCODE.

It looks like the genome, far from being a nicely organized library of genetic building blocks, is a messy snarl of bits of coding DNA, all mixed up together in a pile. There is of course some physical structure to it all, but it seems pretty well jumbled up; the parts of a gene don't even need to be on the same chromosome. It reminded me of Weinberger's big miscellaneous pile, into which all our information goes, waiting to be organized by users and searchers according to their needs and desires. In the Miscellaneous Genome, the users and searchers are the complex regulatory networks of the cell, which seek out and assemble the bits they need to create the machinery and processes of life. They know how to read the genomic metadata that we are trying to grasp; once we can read the metadata, we'll be able to sift through the Miscellaneous Genome with ease.

Go read the book; go read the articles. Good stuff.

Tuesday, June 12, 2007

Postgres 8.2.4 Segmentation Fault on Mac OS X

I've been having an annoying segmentation fault with the recent install of PostgreSQL on Mac OS X. This happens whenever I quit psql after changing to a different database.

psql(336) malloc: *** error for object 0x1811000: incorrect checksum for freed object - object was probably modified after being freed, break at szone_error to debug
psql(336) malloc: *** set a breakpoint in szone_error to debug
Segmentation fault


Looks like others have run into this as well:
http://www.entropy.ch/phpbb2/viewtopic.php?p=10266
http://archives.postgresql.org/pgsql-hackers/2006-11/msg00331.php

Apparently it has something to do with readline libraries.... not sure exactly what, though. It's not a deal-breaker or anything, just annoying.

Sunday, June 10, 2007

Concatenating PDFs

A while back I downloaded the Basic Cryptanalysis Army Field Manual from the University of Michigan. The manual is available as a PDF-per-chapter, but I'd like to have the entire manual as one complete PDF.

Then I found out that Mac OS X already has this capability. With a pointer from this site I put together this command to create my single PDF:

/System/Library/Automator/Combine\ PDF\ Pages.action/Contents/Resources/join.py -o military_cryptanalysis.pdf toc.pdf pref.pdf intro.pdf ch1.pdf ch2.pdf ch3.pdf ch4.pdf ch5.pdf ch6.pdf ch7.pdf ch8.pdf ch9.pdf ch10.pdf ch11.pdf ch12.pdf ch13.pdf ch14.pdf ch15.pdf appa.pdf appb.pdf appc.pdf appd.pdf appe.pdf appf.pdf gloss.pdf ref.pdf index.pdf

The resulting PDF is rather large (~31MB), so it seems that there could be some compression to be done. But, the point is you can concatenate PDFs easily right out of the box with OS X.