[Bash Challenge 7] Can You Solve This Bash Script Puzzle?

Welcome to the Bash Challenge #7 by Yes I Know IT & It’s FOSS. In this weekly challenge we will show you a terminal screen, and we will count on you to help us obtaining the result we wanted. There can be many solutions, and being creative is the most amusing part of the challenge.

If you haven’t done it already, do take a look at previous challenges:

You can also buy these challenges (with unpublished challenges) in book form and support us:

Suggested read
Bash It Out! Bash Script Puzzle Book by It's FOSS is Available Now!

Ready to play ? So here is this week’s challenge .

The token counter

This week we’re going back to a more “programming-oriented” challenge. The description being a little bit abstract, try to stay with me for few minutes  — and I hope the description below will be clear enough :

Bash Script Challenge

I have a stream of tokens either ‘RED’ or ‘BLUE’. If you want, you can consider that as a representation of an event stream for example. I have no particular control on that stream. I just know it produces either one or the other token, unpredictably. And I know the steam is finite (i.e.: at some point, there will be no more data to read).

For the sake of this challenge, I used a Bash function to produce that stream. You are not allowed to change that in anyway.

 # You MUST NOT change that :
 stream() {
   TOKENS=( "RED" "BLUE" )
   for((i=0;i<100;++i)) ; do
     echo ${TOKENS[RANDOM%2]}
   done
 }

My goal is to count both the number RED and BLUE tokens there was in he stream. By myself, I was able find a solution to count the number of RED tokens alone :

 # You MUST change that
 stream | \
     grep -F RED | wc -l > RED.CNT
 cat RED.CNT

Unfortunately, I couldn’t find any solution to count both RED and BLUE tokens. That’s why I need your help. Any idea ?

We’re looking forward to read your solutions in the comment section below !

Few details

To create this challenge, I used:

  • GNU Bash, version 4.4.5 (x86_64-pc-linux-gnu)

  • Debian 4.8.7-1 (amd64)
  • All commands are those shipped with a standard Debian distribution
  • No commands were aliased

The solution

How to reproduce

Here is the raw code we used to produce this challenge. If you run that in a terminal, you will be able to reproduce exactly the same result as displayed in the challenge illustration (assuming you are using the same software version as me) :

rm -rf ItsFOSS
mkdir -p ItsFOSS
cd ItsFOSS
clear
stream() {
  TOKENS=( "RED" "BLUE" )
  for((i=0;i<100;++i)) ; do
    echo ${TOKENS[RANDOM%2]}
  done
}
stream | \
    grep -F RED | wc -l > RED.CNT
cat RED.CNT

What was the problem ?

The only difficulty here was my initial attempt is discarding some part of the input, because I directly send the data stream to the grep.

Basically there are three approach to solve that problem :

  • Store the stream data and process them afterward;

  • Duplicate the stream and process two independent path for RED and BLUE tokens;
  • Handle both cases in the same command as they arrive.

For what it worth, after each solution, I give the real-time usage observed on my system. This just an indication and has to be taken with caution. So feel free to make your own the comparison yourself !

The store and process approach

The simplest implementation of the store-and-process approach is obvious:

stream > stream.cache
grep -F RED < stream.cache | wc -l > RED.CNT
grep -F BLUE < stream.cache | wc -l > BLUE.CNT
rm stream.cache
(1.3s for 10,000,000 tokens)

It works, but has several drawbacks : you have to store the data, and data are processed sequentially for each token. More subtle, as you read twice the stream.cache file, you potentially have some race condition if a concurrent process updates that file during processing.

Still in the store-and-process category, here is a completely different solution:

stream | sort | uniq -c
(5.9s for 10,000,000 tokens)

I consider that a store-and-process approach, since the sort command has to first read and store (either in RAM or on disk) all data before being able to process them. More precisely, on my Debian system, the sort command creates several temporary file in /tmp with rw permissions. Basically this solution has the same drawbacks as the very first one but with much worst performances.

Duplicate stream

Do we really have to /store/ the data /before/ processing them ? No. A much more clever idea would be to split the stream in two parts, processing one kind of token in each sub-stream:

stream | tee >(grep -F RED | wc -l > RED.CNT) \
             >(grep -F BLUE | wc -l > BLUE.CNT) \
             > /dev/null
(0.8s for 10,000,000)

Here, there is no intermediate files. The tee command replicates the stream data as they arrive. Each processing unit gets its own copy of the data, and can process them on the fly.

This is a clever idea because not only we handle data as they arrive, but we have now parallel processing.

Handle data as they arrive

In computer science, we would probably say the previous solution took a functional approach to the problem. On the other hand, the next ones will be purely imperative solutions. Here, we will read each token, and /if/ this is a RED token, /then/ we will increment a RED counter, /else if/ this is a BLUE token, we will increment a BLUE counter.

This is a plain Bash implementation of that idea:

declare -i RED=0 BLUE=0
stream | while read TOKEN; do
    case "$TOKEN" in
        RED) RED+=1
             ;;
        BLUE) BLUE+=1
             ;;
    esac
done
(103.2s for 10,000,000 tokens)

Finally, being a great fan of the AWK command, I will not resist the temptation of using it to solve that challenge in a neat and elegant way:

stream | awk '
  /RED/ { RED++ }
  /BLUE/ { BLUE++ }
  END { printf "%5d %5d\n",RED,BLUE }
'
(2.6s for 10,000,000 tokens)

My AWK program is made of three rules :

  • When encountering a line containing the word RED, increase (++) the RED counter

  • When encountering a line containing the word BLUE, increase the BLUE counter
  • At the END of the input, display both counters.

Of course to fully understand that you have to know, for the purpose of mathematical operators, uninitialized AWK variables are assumed to be zero.

That works great. But it requires duplication of the same rule for each token. Not a big deal here as we have only two different tokens. More annoying if we have many of them. To solve that, we could rely on arrays :

stream | awk '
  { C[$0]++ }
  END { printf "%5d %5d\n",C["RED"],C["BLUE"] }
'
(2.0s for 10,000,000 tokens)

We only need two rules here, whatever is the number of tokens :

  • Whatever is the read token ($0) increase the corresponding array cell (here, either C["RED"] or C["BLUE"])

  • At the END of the input, display the content of the array both for the "RED" and "BLUE" cells.

Please notice that "RED" and "BLUE" are now character strings (did you see the double quotes around them ?) And that’s not an issue for AWK since it does support associative arrays. And just like plain variables, uninitialized cells in an AWK associative array are assumed to be zero for mathematical operators.

As I explained it before, I made the choice of using AWK here. But Perl fans might have a different opinion of the subject. If you’re one of them, why not posting your own solution in the comment section ?

Anyway, we hope you enjoyed that challenge. And stay tuned for more fun!

Comments

  1. for token in $(stream); do
    [ “${token}” = “RED” ] && let red=red+1 || let blue=blue+1
    done

    echo “Red:${red} Blue:${blue}”

  2. # Bash does assoc. arrays…
    token_count() {
    typeset -A count
    while read tok; do
    (( count[${tok}]++ ))
    done
    for t in ${!count[*]}; do
    echo $t: ${count[$t]}
    done
    }

    stream | token_count

Leave a Reply

Your email address will not be published. Required fields are marked *

[i]
[i]