Geek in a Suit

Sunday, April 12, 2009

Re-thinking Object-Relational Mapping in a Distributed Key-Store world

JDO, JPA, Hibernate, Toplink, Cayenne, Enterprise Object Framework, etc., are workable object to datastore mapping mechanisms. However, you still have to optimize your approach to the underlying data system, which in most cases historically have been RDBMS. Hence these systems, their habits and idioms are all quite strongly tied to Object-Relational Mapping. Since Google's App-Engine released a Java app hosting service, with a datastore wrapped in JDO or JPA via DataNucleus this week, people have been playing, and the difficulties of these very habits have become clearer. It's easy, when using JDO or JPA with BigTable, to design as if we're mapping to an RDBMS, which a distributed column-oriented DBMS like BigTable is not. There are some good articles on the net about how to think about this sort of data store:

http://www.mibgames.co.uk/2008/04/15/google-appengine-bigtable-and-why-rdbms-mentality-is-harmful/

http://torrez.us/archives/2005/10/24/407/

http://highscalability.com/how-i-learned-stop-worrying-and-love-using-lot-disk-space-scale

I'm struggling with this myself, being a long-time (longer than most) user of Object-Relational mapping frameworks. For example, one cannot do things with BigTable like join across relationships to pre-fetch child object data - a common optimization. Keystores are bad at joins, because they're sparse and inconsistently shaped. The contents of each "row" may not have the same "columns" as each other, so building indexes to join against is difficult. We actually need to re-think normalization, because it's not the same kind of store at all.

Interestingly, OO and RDBMS actually DIDN'T have an impedance mis-match in one key area in that Entity-Relationship models bore a striking structural resemblance to Object-Composition or Class diagrams, and the two could be mapped. Both RDBMS schemata and Object Models were supposed to be somewhat normalized, except where they were explicitly de-normalized for performance reasons. With distributed key-stores, we're potentially better off storing duplicate objects, or possibly building massive numbers of crazy indices. At this point, I don't know what the right answer is, habit wise, and I work for the darned company that made this thing. It's a different paradigm, and optimizing for performance on a sparce key-store is a different skill and knack. And since we're mapping objects to it, we'll have to get that knack then work out what mapping habits might be appropriate to that different store.

Eventually we will work out the paradigms, with BigTable and its open-source cousins (HBase and HyperTable) creating more parallelism and horizontal scaling on a "cloud-y web" as I've seen it called. The forums - especially the google-appserver-java and other forums where objects and keystores have to meet - they will grow the skills and habits within the community. But we have to re-think it, lest we just turn around and say "that's crap" and throw the baby (performance/scaling) out with the bathwater (no joins/set-theory).   

Labels: , , , ,

Monday, December 8, 2008

Agile Engineering Practices course in Ottawa, December 18-19

To my stalwart readers: I'm putting on my first public version of an Agile Engineering Practices training that I've presented with various clients over the last couple of years.

The training will cover a few areas, including:

  • Workflow of an Agile Software Development Cycle
  • Impact of Agile on Conception, Design, Construction, and Verification
  • Conceptual consequences of Agile methods
    • Software as an emergent property
    • Modularity, Dependency, and Component-Orientation
    • Incremental design
  • Infrastructure needs to support Agile methods
    • Build systems
    • Version Control and SCM
    • Continuous Integration
  • Development Practices and Approaches of Agile methods
    • Testing, Test-First, and Test Driven Development
    • Writing testable code
    • Collaborative software development
    • Pair Programming

There is a small overlap with the Scrum training, mostly for context, but a good chunk will focus on the notion of testable, flexible code, and how infrastructure and an agile approach changes your design concepts and techniques. Also, what architectural features will help your code evolve in a healthy way, rather than spin out of control, or paint you into a corner.

Anyway, if you're in Ottawa, or know anyone there who could benefit from such training, let them know. It'll be a small course (no more than 11 students). The cost is initially discounted from its ultimate price of $1,200.00. A steal at $750.00 (CAD of course). You can look at the course here or can register here.

I'm planning to arrange for more of these in other cities in the new year, including Toronto, Somewhere in the 519 area code - probably Kitchener/Waterloo, as well, possibly, as out west in Saskatoon or Calgary or Vancouver. If you are interested in these cities, let me know, and I'll keep you in the loop.

Labels: , , , , , , ,

Tuesday, November 4, 2008

code coverage != test driven development

I was vaguely annoyed to see this blog article featured in JavaLobby's recent mailout. Not because Kevin Pang doesn't make some good points about the limits of code coverage, but because his title is needlessly controversial. And, because JavaLobby is engaging in some agile-baiting by publishing it without some editorial restraint.

In asking the question, "Is code coverage all that useful," he asserts at the beginning of his article that Test Driven Development (TDD) proponents "often tend to push code coverage as a useful metric for gauging how well tested an application is." This statement is true, but the remainder of the blog post takes apart code coverage as a valid "one true metric," a claim that TDD proponents don't make, except in Kevin's interpretation.

He further asserts that "100% code coverage has long been the ultimate goal of testing fanatics." This isn't true. High code coverage is a desired attribute of a well tested system, but the goal is to have a fully and sufficiently tested system. Code coverage is indicative, but not proof, of a well-tested system. How do I mean that? Any system whose authors have taken the time to sufficiently test it such that it gets > 95% code coverage is likely (in my experience) thinking through how to test their system in order to fully express its happy paths, edge cases, etc. However, the code coverage here is a symptom, not a cause, of a well-tested system. And the metric can be gamed. Actually, when imposed as a management quality criterion, it usually is gamed. Good metrics should confirm a result obtained by other means, or provide leading indicators. Few numeric measurements are subtle enough to really drive system development.

Having said that, I have used code-coverage in this way, but in context, as I'll mention later in this post.

Kevin provides example code similar to the following:

String foo(boolean condition) {
   if (condition)
       return "true";
   else
       return "false";
}

... and talks about how if the unit tests are only testing the true path, then this is only working on 50% coverage. Good so far. But then he goes on to express that "code coverage only tells us what was executed by our unit tests, not what executed correctly." He is carefully telling us that a unit test executing a line doesn't guarantee that the line is working as intended. Um... that's obvious. And if the tests didn't pass correctly, then the line should not be considered covered. It seems there are some unclear assumptions on how testing needs to work, so let me get some assertions out of the way...

  1. Code coverage is only meaningful in the context of well-written tests. It doesn't save you from crappy tests.
  2. Code coverage should only be measured on a line/branch if the covering tests are passing.
  3. Code coverage suggests insufficiency, but doesn't guarantee sufficiency.
  4. Test-driven code will likely have the symptom of nearly perfect coverage.
  5. Test-driven code will be sufficiently tested, because the author wrote all the tests that form, in full, the requirements/spec of that code.
  6. Perfectly covered code will not necessarily be sufficiently tested.

What I'm driving at is that Kevin is arguing against something entirely different than that which TDD proponents argue. He's arguing against a common misunderstanding of how TDD works. On point 1 he and I are in agreement. Many of his commentators mention #3 (and he states it in various ways himself). His description of what code coverage doesn't give you is absurd when you take #2 into account (we assume that a line of covered code is only covered if the covering test is passing). But most importantly - "TDD proponents" would, in my experience, find this whole line of explanation rather irrelevant, as it is an argument against code-coverage as a single metric for code quality, and they would attempt to achieve code quality through thoroughness of testing by driving the development through tests. TDD is a design methodology, not a testing methodology. You just get tests as side-effect artifacts of the approach. Useful in their own right? Sure, but it's only sort of the point. It isn't just writing the tests-first.

In other words - TDD implies high or perfect coverage. But the inverse is not necessarily true.

How do you achieve thoroughness by driving your development with tests? You imagine the functionality you need next (your next increment of useful change), and you write or modify your tests to "require" the new piece of functionality. They you write it, then you go green. Code coverage doesn't enter into it, because you should have near perfect coverage at all times by implication, because every new piece of functionality you develop is preceded by tests which test its main paths and error states, upper and lower bounds, etc. Code coverage in this model is a great way to notice that you screwed up and missed something, but nothing else.

So, is code-coverage useful? Heck yeah! I've used coverage to discover lots of waste in my system. I've removed whole sets of APIs that were "just in case I need them" APIs, because they become rote (lots of accessors/mutators that are not called in normal operations). Is code coverage the only way I would find them? No. If I'm dealing with a system that wasn't driven with tests, or was poorly tested in general, I may use coverage as a quick health meter, but probably not. Going from zero to 90% on legacy code is likely to be less valuable than just re-writing whole subsystems using TDD... and often more expensive.

Regardless, while Kevin is formally asking "is code coverage useful?" he's really asking (rhetorically) is it reasonable to worship code coverage as the primary metric. But if no one's asserting the positive, why is he questioning it? He may be dealing with a lot of people with misunderstandings of how TDD works. He could be dealing with metrics bigots. He could be dealing with management-imposed-metrics initiatives which often fail. It might be a pet peeve or he's annoyed with TDD and this is a great way to do some agile-baiting of his own. I don't know him, so I can't say. His comments seem reasonable, so I assume no ill intent. But the answer to his rhetorical question is "yes, but in context." Not surprising, since most rhetorically asked questions are answerable in this fashion. Hopefully it's a bit clearer where it's useful (and where/how) it's not.

Labels: , , , , ,

Tuesday, October 21, 2008

Hi, my name is Christian, and I'm a geek in a suit.

I've been consulting in the high-tech field for over a decade, and programming for substantially longer than that. In fact, my earliest programs were BASIC programs on my C64, though my first true love was an Amiga 1000 owned by my ultra-cool uncle. But these days I wear a suit. I wear a suit because I consult with "geeks" and "suits" and all comers in between. Most of my clients have been in the Financial sector, so dressing up provides credibility with the customer, and my thinking has moved drastically towards a customer-focus over my decade of consulting.

I've started this blog, because I consider things from a few perspectives. As a technologist, I think about software feasibility, artistry, craft, and shippable result. As a geek I think about cool, new, innovative tech. As someone who has to act on behalf of the customer in many situations, I think about value and fitness-to-purpose and total cost of ownership. As a project manager I have to think about predictability, output, project throughput, cost, and so on. As a consultant, I think of cross-cultural conflict resolution, needs analysis, and communications (geeks and suits speak vastly different languages, leaving aside ethno-linguistic or gender communication issues). So my observations hit a lot of these areas. They often will stray into the Agile and Lean categories, though often I end up working in non-agile, non-lean environments, so they're not exclusive to these communities of interest.

I already blog on AgileAdvice.com, but really this is intended as a vehicle for personal thinking that's probably going to stray a bit from that blog's editorial perspective, though there will be some overlap. I may cross-post some. Regardless, I wanted a way to start talking about things important to me, professionally, from a variety of perspectives, whether or not I have an audience. Hopefully, someone will find this useful.

Cheers,

Christian Gruber

Labels: , , , , , , , , ,