Welcome to the Bash Challenge #8 by Yes I Know IT & It’s FOSS. In this weekly challenge, we will show you a terminal screen, and we will count on you to help us obtaining the result we wanted. There can be many solutions, and being creative is the most amusing part of the challenge.
If you haven’t done it already, do take a look at previous challenges:
You can also buy these challenges (with unpublished challenges) in book form and support us:
Ready to play? So here is this week’s challenge.
How to add a header?
This week I work with several data files and one header file. I just want to insert the contents of the header file on top of each data file:
For the sake of the demonstration, I only displayed one file. But you may imagine I have many of them — too many for considering manual editing.
Anyway, for some reason my solution didn’t work: not only I’ve lost the data but my header appears twice.
cat HEADER DATA01 | tee DATA01
# Month, Year, Est.Value
# Month, Year, Est.Value
As you can see, I really need your help here — both to explain to me what was going on and to help me in solving that issue. I’m really looking forward to read your solutions in the comment section below!
Few details
To create this challenge, I used:
- GNU Bash, version 4.4.5 (x86_64-pc-linux-gnu)
- Debian 4.8.7-1 (amd64)
- All commands are those shipped with a standard Debian distribution
- No commands were aliased
Solution
How to reproduce
Here is the raw code we used to produce this challenge. If you run that in a terminal, you will be able to reproduce exactly the same result as displayed in the challenge illustration (assuming you are using the same software version as me):
rm -rf ItsFOSS
mkdir -p ItsFOSS
cd ItsFOSS
cat > HEADER << EOT
# Month, Year, Est.Value
EOT
cat > DATA01 << EOT
Dec, 2015, 15000
Jan, 2016, 12540
Feb, 2016, 11970
EOT
clear
head HEADER DATA01
cat HEADER DATA01 | tee DATA01
What was the problem?
In a pipeline, all commands are launched in parallel. That means the cat
command reading the DATA01 file and the tee
command overwriting that same file are launched simultaneously.
This is really a race condition. On my system, tee
had time to overwrite the destination file before cat
had the opportunity to read it. To illustrate that, we can delay the commands and see the output is clearly dependent on the timing:
cat HEADER DATA01 | ( sleep 1; tee DATA01 )
# Month, Year, Est.Value
Dec, 2015, 15000
Jan, 2016, 12540
Feb, 2016, 11970
(sleep 1 ; cat HEADER DATA01 ) | tee DATA01
# Month, Year, Est.Value
I would have a similar issue (albeit deterministic this time) using the simpler:
cat HEADER DATA01 > DATA01
In that case, the shell always overwrites the destination file before launching the cat
command. So the content of the file is lost long before cat
had even the opportunity to read it.
How to fix that?
Obviously, no one would ever use the sleep
hack in a real situation. But this is not an issue: as part of the standard POSIX tools, we have several commands at our disposal to insert the header on top of a file. Before that, let’s take a look at the most basic solution.
The KISS solution
cat HEADER DATA01 > DATA01.NEW
mv -f DATA01.NEW DATA01
Do I really need to comment that? Well, while being rudimentary, this solution has a nice feature: since rm
will use the system call rename
, which itself is atomic in that sense that referencing the DATA01
file, other processes will either see the old content or the new content — but neither an half-written content.
A somewhat similar solution, but avoiding to create a temporary file visible on the filesystem would obtain first a file descriptor to read from the original file before overwriting it:
exec 3<DATA01 # (1)
rm -f DATA01 # (2)
cat HEADER - <&3 >DATA01 # (3)
exec 3<&- # (4)
-
Open the file DATA1 for reading using the file descriptor 3;
- Unlink the original file (i.e.: remove its directory entry, but not the data as the file is still open);
- Use cat to read the header first, followed by a stdin read from file descriptor 3 and write to a new DATA01 file;
-
Close the file descriptor 3 This will effectively delete the old DATA01 content.
Please note this solution is no longer atomic in the sense used above. Anyways, Kudos to Adithya Kiran Gangu for having proposed that solution!
Using sed
While encountering similar problems for the first time, my idea was to use sed
. It is quite easy to insert a “header” after the first line using sed
. But it’s more difficult to insert something before the first line. In fact, to achieve that, we will need a little bit of magic:
sed -i '1{
r HEADER
N
}' DATA01
To fully understand, you need to know the (r)ead command inserts the content of a file in the destination stream, but only once the current line processing has ended. That’s why I used the (N)ext command: it will end the line 1 processing early (i.e.: before normal line output). So, when encountering that command, sed
ends processing of line 1. Which triggers output of the content of the HEADER file. But the line 1 itself is not sent to the output. It is kept in the sed
buffer.
Then sed
reads the next line of input, append it to the buffer, and as we do not have any rule for line 2, process it as usual by sending its buffer to the output (remember at that stage, the buffer contains both line 1 and line 2).
This solution has a major drawback: it assumes there is a line 2. If the data file contains only one line, this will fail miserably.
Using ed
or ex
We have very few occasions of using ed
or its cousin ex
. Both are line oriented editors. Their behavior is very similar to vi
in that sense you load file into memory, and send commands to the editor to modify that file. The only difference here is we will script the commands instead of sending them interactively.
ed DATA01 << .
0r HEADER
wq
.
ex -s DATA01 << .
0r HEADER
wq
.
This works great, but as we have to load the whole file into memory which could be an issue for very large files.
As always, those are probably only a subset of all possible solutions. So don’t hesitate to use the comment section to share your own ideas.
And stay tuned for more fun!
for i in `ls data*`;do cat header $i > tmp && mv tmp $i;done
Yes, using a temporary file is probably the simplest solution to that issue. Congratulations for being pragmatic !
cat DATA01 | tee -a HEADER
Yes, but by doing so, you trash my HEADER file, and I can no longer use it for DATA02, DATA03, and so on. Or did I missed something ?
My Bad
cp HEADER HEADER_DATA01 && cat DATA01 | tee -a HEADER_DATA01
The ed editor??? Yup, the old ed editor is still installed by default on most Unixes and can still be useful:
ed -s DATA01 << HEADER DATA01 <==
Dec, 2015, 15000
Jan, 2016, 12540
Feb, 2016, 11970
ed -s DATA01 <<< $'0r HEADER\nw'
Sorry, output got scrambled in the first post.
Nice idea using `ed`. As a matter of fact, this is usually my own fall-back choice when something would lead to an over complicated `sed` solution. Good job !
The problem with tee is that it empties the file immediately before cat operation.. use sponge utility from more utils solves this issue.
cat header data01 |sponge data01
we can also use temporary file descriptors to save the content and readout from them in the below fashion. pure bash based solution. Its not recommended to use stdin and stdout operations on the same file when you use pipe operation.
$ exec 3<DATA01 && rm -f DATA01 && cat HEADER – DATA01 && exec 3<&-
1. create a temp file descriptor 3 for DATA01
2. remove the original file
3. use cat to read the header first followed by a stdin read from file descriptor 3 and write to DATA01
4. close the file descriptor 3
Like · Reply · 1 min · Edited
typo in the command corrected:
$ exec 3<DATA01 && rm -f DATA01 && cat HEADER – DATA01 && exec 3<&-
Nice use of a temporary file descriptor. Congratulations for that idea !
Yes, `sponge` is designed to solve that kind of issue. I don’t have it at hand though. How would it perform on very large files ?