My Website: https://data36.com
Whole article: https://data36.com/data-coding-bash-best-practices/
Note: If you are new here, I suggest to start with these previous data coding and/or bash articles:
Data Coding 101 – Install Python, R, SQL and bash! - https://data36.com/data-coding-101-install-python-sql-r-bash/
Data Coding 101 – Intro to Bash ep#1 - https://data36.com/data-coding-101-introduction-bash/
Tab key: Any time when you start typing in the command line and you hit the TAB key, it automatically extends your typed-in text, so you can spare some characters.
Up/down arrows: They help you to bring back your previous commands. So if you misspelled something or if you want to do a small modification, you don’t have to type the whole command again.
history --» This bash tool prints your recently used commands on your terminal screen. (Pro tip: try it with grep! Eg. history |grep 'cut' will list you all the commands you have used and contained cut.)
CTRL + R or clear: It cleans your screen. Better for your mind, better for your eyes!
First install CSVKit!
sudo pip install csvkit
Note: First I’ve heard about the CSVKit bash tools in this book: Jeroen Janssen - Data Science at the Command Line.
csvlook helps you to see your csv files in a much cleaner, much processable-by-humans format. Eg. here’s a short sample from our flightdelays.csv file:
cat 2007.csv |head |cut -d',' -f12,13,14,15| csvlook
csvstat gives you back some basic statistics about your dataset. Try:
cat demo1.csv | csvstat
(Even median is there! Remember last time how hard it was to get it?)
Note: csvstat is unfortunately not so great with bigger files. So you can’t use it for the flightdelays.csv for example, because that file is way too big.
This will sound dummy, but I assure you this is a real problem and this is a real solution for it. This is what data scientists do, when they use command line in real life. The problem: when you cat a file on your screen, then another one right after, it’s really hard to find the first row of the second file. The main reason is, that the prompt looks like every other line in your files. If you’ve watched the video above, you have seen, that I colored my prompt. That’s part of the solution, but to make sure I will find the first line of my second file, before the second cat I usually hit 10-15 blank enters.
man --» this is a bash command to learn more about specific command line tools. Eg. try:
man cat --» and you will get into the manual of cat. It works for almost every command. (man cut, man grep, etc…) The good thing in it is that in each manual you can find a great list for all the options for the given command.
Googling + StackOverflow
I know this is something I should not even mention, but still: if you have a question, you can be sure that someone has already asked it and another one has already answered it somewhere on the internet. So just don’t forget to use Google first. Most of the answers are on a website called StackOverflow - by the way. If it’s accidentally not there, Stackoverflow is still awesome, because you can also ask questions there. There are many nice and smart people there, from whom you will get an answer fast, so don’t be shy! ;-)
A great book about Data Science at the Command Line
Jeroen Janssen - Data Science at the Command Line
As far as I know, this is the one and only book that writes about bash as a tool for data scientists. It comes with many great tips and bash best practices! It assumes that you have some initial Python and/or R knowledge, but if you don’t, I still recommend it. If you have read my Data Coding 101 articles about bash so far, it won’t cause any issue the understand the most of this nice book!
Today we went through some great tools to make your job in bash cleaner, faster and smarter.
Next week I’ll show you two major control flow components of bash: the if command and the while loops. (And they are even more important, as we are gonna use the same logic later in Python and R as well!)
If you don’t want to miss any of my new data contents (articles, videos, e-books, etc), subscribe to my Newsletter: https://data36.com/newsletter-subscription/