Counting lines of code

Saturday 19 January 2019

I wrote an Open edX blog post about the need to move from Python 2 to Python 3. For emphasis, I wanted to say how much code there was. Open edX is a large project spread across a number of repos. Why spend 30 minutes writing a blog post when you can first spend two hours fiddling around with line-counting tools to get a vague factoid for the blog post?

The old standard tool for line-counting is cloc. It has way too many options, many of which don’t work quite the way I would have expected, but it gets the job done, with some bash support. My resulting monster is below.

It over-counts JavaScript code because there are lots of places that JavaScript gets checked into git that isn’t code we wrote. I don’t know what to do about that. Oh well.

BTW, on the subject of line counting: once, helping someone with a program, I saw they were using semicolons to end their Python statements. I said they didn’t need them, and they replied, “Yes I do, because my manager’s line-counting software requires them.” !!!

Be careful out there...

#!/bin/bash

# Count lines of code in a tree of git repos.
# Needs cloc (https://github.com/AlDanial/cloc)

REPORTDIR=/tmp/cloc-reports
mkdir -p $REPORTDIR
rm -rf $REPORTDIR/*

cat <<EOF > $REPORTDIR/exclude-files.txt
package-lock.json
EOF

cat <<EOF > $REPORTDIR/more-langs.txt
reStructured Text
    filter remove_matches xyzzy
    extension rst
    3rd_gen_scale 1.0
SVG Graphics
    filter remove_html_comments
    extension svg
    3rd_gen_scale 1.0
EOF

find . -name .git -type d -prune | while read d; do
    dd=$(dirname "$d")
    if [[ $dd == ./src/third-party/* ]]; then
        # Ignore repos in the "third-party" tree.
        continue;
    fi
    echo "==== $dd =============================================="
    cd $dd
    git remote -v

    REPORTHEAD=$REPORTDIR/${dd##*/}
    cloc \
        --report-file=$REPORTHEAD.txt \
        --read-lang-def=$REPORTDIR/more-langs.txt \
        --ignored=$REPORTHEAD.ignored \
        --vcs=git \
        --not-match-d='.*\.egg-info' \
        --exclude-dir=node_modules,vendor,locale \
        --exclude-ext=png,jpg,gif,ttf,eot,woff,mo,xcf \
        --exclude-list-file=$REPORTDIR/exclude-files.txt \
        .
    cd -
done

cloc \
    --sum-reports \
    --read-lang-def=$REPORTDIR/more-langs.txt \
    $REPORTDIR/*.txt
» 3 reactions

Comments

[gravatar]
Dirkjan Ochtman 8:58 PM on 19 Jan 2019

One of the more modern (faster) tools is tokei. Might be nice to try some time?

https://github.com/Aaronepower/tokei

[gravatar]
Nick 11:51 PM on 19 Jan 2019

+1 to tokei

[gravatar]
Ned Batchelder 1:49 PM on 20 Jan 2019

Just to add to the complexity: cloc and tokei differ in how they count: cloc ignores zero-length files, tokei doesn't. cloc counts docstrings as comments, tokei counts them as code.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.