Agile Zone is brought to you in partnership with:

I am a programmer and architect (the kind that writes code) with a focus on testing and open source; I maintain the PHPUnit_Selenium project. I believe programming is one of the hardest and most beautiful jobs in the world. Giorgio is a DZone MVB and is not an employee of DZone and has posted 638 posts at DZone. You can read more from them at their website. View Full User Profile

Temporal correlation in Git repositories

10.06.2011
| 5942 views |
  • submit to reddit

Michael Feathers presented his recurring idea of finding out which elements of design change together: his goal is to discover which classes or methods are really coupled by analyzing empirical data instead of static analysis. Since he didn't publish code, I'm trying to replicate the analysis with a language-agnostic script.

A different approach to discover dependencies

The data he mines for this information are commits in the source code repository; this approach is different from the static analysis one. In the latter scenario, the analysis consist in using reflection to annotate which class extends or implement another; two elements result coupled or not coupled just as a function of their declarations, without taking into consideration how frequently they are changed or when a change into one ripples into the other.

In my opinion, another problem with static analysis is that in very dynamic languages there are not even relationships that can be analyzed. In PHP we still have explicit interfaces, but in Ruby or JavaScript there are no such things.

How it works

The analysis of Michael Feathers was conducted at the method level. The basic idea is to look at the commits as a source of data about changes in your codebase, more than to the code itself.

The assumption, which should be correct in most workflows, is that a commit correlates in an atomic operation many classes and methods that constitute a new functionality or a refactoring. Therefore, each commit that you find where class A and B are modified together adds a score to the relationship between them.

In his example, he worked with classes and methods. I put together a quicker example based on files, since in our standard one class always corresponds to a single file.

If we are able to gather which files are always modified together, we can discover some hidden coupling on an objective basis. For example, I would be interested in code that changes together between layers, or between distant part of the codebase. Feathers himself writes about informative data on the coupling of domain objects.

My implementation counts

The basic item considered by my code is a couple of files (read classes), represted as two file names. The focus on file also makes this analysis language-agnostic.

Each commit indicates that the files A, B, C, D... were modified together. So for each commit I add a certain score to the couples A,B; A,C; A,D; B,C; B,D; C,D. I do the analysis over at least a hundred (200) commits for statistical significance. It takes long.

The score for each commit is normalized to the number of files in the commit:

  • with just two files, the pair gets an addition to the score of 0.5.
  • with N files, each possible pair gets an addition of 1/N. e.g. N = 100 results in each pair being added 0.01. Thus, large commits where you just modify a copyright notice or a coding standard throughout hundreds of files do not count much (should they?).

An example

I run the code over the master branch of Doctrine 2 on Github (24042863acbabdcd0fa1432135a9836467f3bce7 at the time of this writing). These were the results (limited to the first 10 pairs).

array(10) {
  ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=>
  float(31.318715901959)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=>
  float(29.233642885566)
  ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=>
  float(27.059301324524)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Query/Parser.php"]=>
  float(25.077708602949)
  ["lib/vendor/doctrine-dbal|lib/Doctrine/ORM/Query/SqlWalker.php"]=>
  float(22.883601306557)
  ["lib/Doctrine/ORM/UnitOfWork.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=>
  float(22.658756168829)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataFactory.php"]=>
  float(22.625098398742)
  ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/Query/Parser.php"]=>
  float(21.878131735017)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Persisters/BasicEntityPersister.php"]=>
  float(21.422968340511)
  ["tests/Doctrine/Tests/ORM/Query/SelectSqlGenerationTest.php|lib/Doctrine/ORM/UnitOfWork.php"]=>
  float(20.691940478915)
}

The 1st, 3rd, 8th and 10th instances are example of coupling between test and production code, which is to be expected. Unless the test is really unrelated from the production code, I won't say this points an issue. SelectSqlGenerationTest is a very important functional test which verifies that operations involving SELECT queries are executed correclyt: probably new test cases lead to production code being add.

If we delete the test-related pairs, we get:

  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php"]=>
  float(29.233642885566)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Query/Parser.php"]=>
  float(25.077708602949)
  ["lib/vendor/doctrine-dbal|lib/Doctrine/ORM/Query/SqlWalker.php"]=>
  float(22.883601306557)
  ["lib/Doctrine/ORM/UnitOfWork.php|lib/Doctrine/ORM/Query/SqlWalker.php"]=>
  float(22.658756168829)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Mapping/ClassMetadataFactory.php"]=>
  float(22.625098398742)
  ["lib/Doctrine/ORM/Query/SqlWalker.php|lib/Doctrine/ORM/Persisters/BasicEntityPersister.php"]=>
  float(21.422968340511)

Query/SqlWalker is a class which is always present in these relationships with other classes. Its responsibility is a big one: transforming a parsed tree of DQL (language for querying objects) into SQL queries.

In fact, a quick test on classes length returns:

[09:21:25][giorgio@Desmond:~/code/doctrine2]$ wc `find . -name '*.php'` | sort -n | tail -n 5
   1942    6826   65254 ./lib/Doctrine/ORM/Mapping/ClassMetadataInfo.php
   2062    6731   80818 ./lib/Doctrine/ORM/Query/SqlWalker.php
   2410    8308   95066 ./lib/Doctrine/ORM/UnitOfWork.php
   3004    7535  101628 ./lib/Doctrine/ORM/Query/Parser.php
  76445  223518 2587575 total  

Thus SqlWalker is one of the biggest classes in the codebase, and one that changes very often with others: if the goal is to get the best bang for my buck, it would be my starting point for refactoring. But now I have to go back to my company's code to find out where the pain points are. :)

I forgot: here's my quickly hacked up code for getting these statistics.

<?php
class IncidenceMatrix
{
    private $matrix = array();

    public function addHit($element, $otherElement, $score)
    {
        if ($element > $otherElement) {
            $first = $element;
            $second = $otherElement;
        } else {
            $first = $otherElement;
            $second = $element;
        }
        if ($first == '' or $second == '') {
            return;
        }
        $key = $first . '|' . $second;
        if (!isset($this->hash[$key])) {
            $this->hash[$key] = 0;
        }
        $this->hash[$key] += $score;
    }

    public function getTopHits($howMany)
    {
        $hash = $this->hash;
        arsort($hash);
        return array_slice($hash, 0, $howMany);
    }
}

$commitListCommand = 'git log --oneline | head -n 200';
exec($commitListCommand, $logOfCommits);
$incidenceMatrix = new IncidenceMatrix();
foreach ($logOfCommits as $commitLog) {
    list ($commit, ) = explode(' ', $commitLog);
    $filesListCommand = "git show --pretty='format:' --name-only $commit";
    exec($filesListCommand, $fileList);
    $commitScore = 1 / count($fileList);
    for ($i = 0; $i < count($fileList); $i++) {
       for ($j = $i + 1; $j < count($fileList); $j++) {
           $incidenceMatrix->addHit($fileList[$i], $fileList[$j], $commitScore);
       }
    }
}
$topHits = $incidenceMatrix->getTopHits(10);
var_dump($topHits);
Published at DZone with permission of Giorgio Sironi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)