Old-fashioned Way About Data Science
03 Nov 2019 - freefd
Share on:The usual Pythonic way to get a data series is sometimes not the fastest. In some cases, it’s enough to have only 3-6 Linux CLI tools to get a necessary data from Web and parse it in a proper way. Let’s look at such an approach on how to count the number of released RFC’s per year and draw the graph in a terminal.
Update History
- 2022-02-06: Improved escaping of special characters in awk delimiter as this didn’t work for the gawk implementation.
All you need is text
The important question at the start is always “Where can I get the data source?”. Fortunately, in 2019th you can find the organized list for all of RFCs 1.
At the first glance it seems that we need to parse HTML or XML data, but it’s not. Actually, we need to parse a simple text/plain data which should be in the same format for all entries. Here comes our first helper - w3m 2 CLI text-based browser.
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | head -50
RFC Index
...
avoided for brevity
...
See the RFC Editor Web page for more information.
RFC Index
Num Information
0001 Host Software S. Crocker [ April 1969 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0001)
0002 Host software B. Duvall [ April 1969 ] (TXT, PDF, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0002)
0003 Documentation conventions S.D. Crocker [ April 1969 ] (TXT, HTML) (Obsoleted-By RFC0010) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0003)
That’s how we find strictly-formed entries as follows:
NUMBER DESCRIPTION [ DATE ] OTHER_TECHNICAL_INFO
and we can extract the date information using awk 3 - a DSL designed for text processing. I prepared a test file from a small fragment of the collected data:
~> cat test-entries
0506 FTP command naming problem M.A. Padlipsky [ June 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0506)
0507 Not Issued
0508 Real-time data transmission on the ARPANET L. Pfeifer, J. McAfee [ May 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0508)
0509 Traffic statistics (April 1973) A.M. McKenzie [ April 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0509)
~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries
June 1973
May 1973
April 1973
Please note that ^[0-9][0-9][0-9][0-9]
is used instead of ^[0-9]{4}
because mawk 4 doesn’t support 5 repetition.
Since now we able to extract human-readable dates, let’s convert them to numeric format with date 6 utility during implicit loop given from xargs 7:
~> # Years extraction
~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%Y"
1973
1973
1973
~> # Months extraction
~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m"
06
05
04
The next simple step is sorting (sort 8 tool) and counting unique (uniq 9 tool) values:
~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%Y" | sort -n | uniq -c
3 1973
~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort -n | uniq -c
1 04
1 05
1 06
We’re ready to put it all together over pipes:
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%Y" | sort -n | uniq -c
1 1968
25 1969
58 1970
182 1971
134 1972
162 1973
60 1974
24 1975
11 1976
20 1977
8 1978
7 1979
17 1980
29 1981
37 1982
49 1983
39 1984
41 1985
...
Drawing in terminal
Bare numbers aren’t very comfortable for analysis and thus we’ll use gnuplot 10 utility to draw graphs in the following configuration 11:
gnuplot -e "set term dumb size 145, 25; set xtics 3; plot '-' with lines notitle"
It’ll read the STDIN stream and draw a 145x25 graph right in the terminal. Putting it all together one more time:
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%Y" | sort -n | uniq -c | gnuplot -e "set term dumb size 145, 25; set xtics 3; plot '-' with lines notitle"
line 52: warning: Too many axis ticks requested (>2e+02)
2020 +-------------------------------------------------------------------------------------------------------------------------------------+
| **************************************** |
| **************** |
2010 |-+ ************************* +-|
| ************ |
| *********************************************|
| ***************************** |
2000 |-+ ************************** +-|
| **************************** |
| ***************** |
1990 |-+ ************************ +-|
| *** |
| ****** |
| ***** |
1980 |-******** +-|
| ***** |
| ****************************************** |
1970 |-+ ****************************************** +-|
|*********** |
| |
| |
1960 +-------------------------------------------------------------------------------------------------------------------------------------+
As you can see, something is going wrong. This is because gnuplot expects the first column as the X-axis and the second as the Y-axis. We need to swap our columns with each other:
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%Y" | sort -n | uniq -c | awk '{print $2" "$1}'
1968 1
1969 25
1970 58
1971 182
1972 134
1973 162
1974 60
1975 24
1976 11
1977 20
1978 8
1979 7
1980 17
1981 29
1982 37
1983 49
1984 39
1985 41
...
And rerun our graph plotting:
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%Y" | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 145, 25; set xtics 3; plot '-' with lines notitle"
500 +--------------------------------------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + + + + + + + |
450 |-+ * +-|
| ** |
400 |-+ * * +-|
| * * ** |
350 |-+ * * ** * +-|
| * * * *** ** |
300 |-+ ** *** * ** ****** +-|
| ** * **** * |
250 |-+ *** * * * +-|
| ** * ** * |
| * * *** * |
200 |-+ *** ** ** ** *-|
| * ** ** * *** *|
150 |-+ * * * * ** +-|
| * * * |
100 |-+ * * **** +-|
| * *** ** |
50 |-+ ** ************ ********** +-|
| ** + + ** *** ***** + * + + + + + + + + + + |
0 +--------------------------------------------------------------------------------------------------------------------------------------+
1968 1971 1974 1977 1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019
Now it looks nice. For months statistics we see the expected deviations for July and November/December, the most productivity release dates are in March/April:
~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%m" | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 145, 25; set xtics 1; set ytics 40; plot '-' notitle smooth csplines"
840 +--------------------------------------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + |
| ********* |
800 |-+ *** ** +-|
| * *** |
760 |-+ * *** ******* **** * +-|
| ** *** *** *** ** *** *** *** |
| * ******* * * * * ** |
720 |** *** * ** * *** * +-|
| **** ** ** * *** *** * |
680 |-+ **** * * * * +-|
| * * ** |
| ** ** * |
640 |-+ * * * +-|
| *** * * |
| * * |
600 |-+ * +-|
| ** |
560 |-+ ** +-|
| ********|
| + + + + + + + + + + |
520 +--------------------------------------------------------------------------------------------------------------------------------------+
1 2 3 4 5 6 7 8 9 10 11 12
Moreover, we can plot the same graph for IETF 12 Internet-Drafts 13 to realize how rapidly their numbers are growing:
~> curl -s http://mirror.funkfreundelandshut.de/ietf/internet-drafts/all_id.txt | awk '/^draft/{print $2}' | xargs -I {} env TZ=Europe/London date -d'{}' +"%Y-%m" | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 145, 25; set xtics 2; set ytics 20; plot '-' notitle smooth csplines"
500 +--------------------------------------------------------------------------------------------------------------------------------------+
480 |-+ + + + + + + + + + + + + + + + +-|
460 |-+ +-|
440 |-+ +-|
400 |-+ +-|
380 |-+ +-|
360 |-+ +-|
340 |-+ +-|
320 |-+ +-|
280 |-+ +-|
260 |-+ +-|
240 |-+ +-|
220 |-+ * +-|
200 |-+ * +-|
160 |-+ * +-|
140 |-+ **** ************************* * +-|
120 |-+ *** ********************* ****** ***** * +-|
100 |-+ ****** ****** **** +-|
80 |-+ ******* +-|
40 |-+ ******* +-|
20 |-+ + +*************** + + + + + + + + + + + + +-|
0 +--------------------------------------------------------------------------------------------------------------------------------------+
1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
References
1. RFC Editor ↩
2. Text-based web browser w3m ↩
3. “Aho, Weinberger and Kernighan” domain-specific language ↩
4. awk originally written by Mike Brennan ↩
5. Built-in regex’s do not support brace-expressions ↩
6. date - write the date and time ↩
7. xargs - construct argument lists and invoke utility ↩
8. sort - sort, merge, or sequence check text files ↩
9. uniq - report or filter out repeated lines ↩
10. gnuplot - portable command-line driven graphing utility ↩
11. gnuplot documentation ↩
12. Internet Engineering Task Force ↩
13. Internet-Drafts ↩