Java Platform, Enterprise Edition

Java EE Journal

Subscribe to Java EE Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Java EE Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


J2EE Journal Authors: Zakia Bouachraoui, Douglas Lyon, Stackify Blog, APM Blog, Sumith Kumar Puri

Related Topics: Java EE Journal, SOA & WOA Magazine, Java Developer Magazine, Sarbanes Oxley on Ulitzer

Blog Feed Post

Why Are There So Many Hurdles to Efficient SAM Benchmarking?

When dealing with Software Analysis and Measurement benchmarking, people's behavior generally falls into two categories

Two opposite sides
When dealing with Software Analysis and Measurement benchmarking, people’s behavior generally falls in one of the following two categories:

  • “Let’s compare anything and draw conclusions without giving any thought about relevance and applicability”
  • “There is always something that differs and nothing can ever be compared”

As often, there is no sensible middle ground.

Benchmarking challenges
Some of the most common reasons for objecting to comparing SAM results are:

  • Applications use different technological stacks, with variable thresholds on the level of difference that matters, such as:
    • Object-oriented vs. procedural, implying that comparing object-oriented applications together is OK
    • JEE vs. .NET, implying that comparing object-oriented applications together is not OK if they are not of the same flavor
    • With or without Hibernate framework (or any other framework for that matter), implying that comparing applications using different frameworks is not OK
  • Measurement capability evolves with
    • New measure elements, to check for new risk-inducing patterns
    • Improving existing measure elements to diminish false positives
  • Measurement process relies on some contextual information, such as
    • Target architecture
    • Vetted libraries
    • In-house naming norms

All of the above reasons make perfect sense and one has to be well aware of such situations. However, this is no ground for dismissing any possibility of comparison altogether.

Built-in clutch
A well-designed measurement model can help overcome these challenges. Indeed, with a measurement model which aligns on a Goal-Question-Metric approach, the three levels act like a built-in clutch.

How so? Because the Metrics required to answer the Question one has to ask themselves to measure the level of achievement of each Goal can differ from one technology stack to another, or evolve from one release of the measurement platform to another without invalidating the results. This is also true about the number and nature of the Questions one has to ask themselves to measure the level of achievement of each Goal.

Then, with a measurement model using a compliance ratio to consolidate and aggregate (as opposed to raw count of non-conformity), this statistical processing acts as a built-in clutch.

From one technology stack to another
At the Question level, even if the number of contributing Metrics differ, capabilities in different technologies differ and the sources of issues as well, hence a different number of Metrics. However, if the Question is well answered by the fewer or larger number of Metrics, the difference is then normal and acceptable.

At the Goal level, even if the number of contributing Questions differ: the difference is inherent to the technology and is therefore normal and acceptable.

For example, there is no possibility to over-complexify COBOL code with coding and architectural practices related to Object-oriented capabilities, while there are many ways with JEE. This comparison might seem unfair to JEE that is assessed with more rules and technical criteria than COBOL, but this is inherent to their respective capabilities and a fair assessment of transferability or changeability risk must take this difference into account.

As this assessment result will guide a decision for resource allocation, it is critical to know that in this JEE component, there is not only some regular algorithmic or SQL or XYZ complexity, but there is an additional load of complexity due to excessive polymorphism or XYZ.

From one release to another
New releases of the measurement system are designed to deliver more accurate assessment results. However, added accuracy will impact results. The use of compliance ratio can help limit the impact, assuming the impacts have to be limited.

  • Assuming the new supported syntax or extra capability leads to the testing of new objects and that the quality level is similar to objects with previously supported syntax, the ratio will be stable.
  • Assuming the new supported syntax or extra capability leads to the testing of new objects and that the quality level is really worse than objects with previously supported syntax, the ratio will be negatively impacted for the better: it makes quality issues visible.
  • This reasoning can also be applied to a new quality check: there's no impact if the quality level is similar, and impact if there is a new piece of information to be known.

Why not rely on a raw count of violations?
Using a raw count of violations would lead to an increase in the number of violations, regardless of the quality level. Any new rule can lead to more violations as it turns invisible quality issues into visible ones, even if the compliance ratio for the new rule is better than the rest of the pack.

From one context to another
This area is perhaps the most delicate because, the ability to compare relies on the assumption that human contextual input differs but that it is fairly set.

To expect fairness may seem naïve, but a lot of the measurement process already relies on some amount of fairness. For example, what are the true boundaries of the application you're measuring? Nowadays, with cross-app principles, the definition is blurred and it could be easy to omit part of the application to hide some facts from management.

One could also define a target Architecture that would work in their best interest, but that would mean entering into the system some flawed configuration data that anyone can review.

One could vet the libraries to prevent security-related vulnerabilities, improving assessment results, but that would also mean entering some flawed configuration data.

At the end of the day
Yes, there can be true differences between the way multiple applications are assessed.

Although, this is no reason to dismiss benchmarking altogether. Not hiding the differences and their impact is important though.

As Dr. Bill Curtis stated during the CISQ (http://www.it-cisq.org/) Seminar in Berlin on June 19th when explaining the key factors to conduct a clever productivity analysis - always inspect data, investigate outliers and extreme values, and always question the results too.

In other words, always use your brain.

Read the original blog entry...

More Stories By Lev Lesokhin

Lev Lesokhin is responsible for CAST's market development, strategy, thought leadership and product marketing worldwide. He has a passion for making customers successful, building the ecosystem, and advancing the state of the art in business technology. Lev comes to CAST from SAP, where he was Director, Global SME Marketing. Prior to SAP, Lev was at the Corporate Executive Board as one of the leaders of the Applications Executive Council, where he worked with the heads of applications organizations at Fortune 1000 companies to identify best management practices.