We are experts in statistical programming – we have been using R/Python/C/C++/JavaScript since 2003 (20+ years). During that time, we have been able to see the advantages and disadvantages of different programming approaches, and we would be excited to share our insights with you, during a half-day or full-day workshop, including seminars on one or more of the following topics.
Seminar topics
data.table
for data science
We will teach you how to speed up your data analysis scripts by
10-100x using data.table
, which is a
state-of-the-art in-memory database system, implemented as an R
package with efficient C code. We teach a wide range of topics,
including basics such as reading/writing CSV, as well as advanced
topics such as rolling joins for time series data. Example: 3 hour
tutorial with exercises presented at LatinR conference, Montevideo,
Uruguay, Oct
2023.
torch
for deep learning in R and python
Do you want to create applications that use artificial intelligence,
trained using machine learning algorithms on big data sets? (text,
image, tabular, time series, etc) We teach how to use
torch
, from basics like stochastic gradient descent learning with
DataLoaders, to advanced topics like application-specific loss
functions, such as the AUM loss which we proposed for learning with
imbalanced binary labels (for example 99% negative labels, 1% positive
labels, Python
code, R
code).
Example: Introduction to Deep Learning in R, a 2 hour lecture for
Research Bazaar Arizona, Flagstaff, April
2023.
Machine learning with mlr3
package
Do you want to be able to easily compare prediction accuracy of
different machine learning algorithms, to help you decide which one
you should deploy for your own data sets? We teach how to do
that using the mlr3
framework, which provides an easy-to-use
interface for many common and highly effective machine learning
algorithms (neural networks, linear models, random forests, boosting,
support vector machines, etc). Advanced seminars include
parallelization using mlr3batchmark
package. Example: The
importance of hyper-parameter tuning blog
post.
Advanced cross-validation for evaluating machine learning model predictions
Have you ever wondered if your machine learning model trained on this
year will still be accurate next year? Or if a model trained on data
from one country would be as accurate in another country? Or if your
machine learning algorithm would get better predictions if you had
more training data? We teach how to answer these questions with the
easily paralellizable implementation of cross-validation in our
mlr3resampling
R package. Examples: New code for various kinds of cross-validation
blog post,
Slides for talk about Same/Other/All K-fold cross-validation
(SOAK).
Regular expressions (regex) for text parsing and data reshaping
Does any of your data analysis involve scraping data from web pages,
or other text files like server logs? We teach how to use
regex for extracting tabular data from such data. Refactoring text
parsing code using regex can yield big speedups; using modern
packages like nc
, the code can be
made simple, readable, and maintainable! Examples: Regex tutorial with exercises
presented at McGill University, Montreal, Quebec,
2015, Collaborations not allowed blog post about web scraping.
Sequential and time series data analysis
If you work with time series data, you have probably wondered if there are better analysis algorithms than sliding windows. We teach about time series data forecasting and optimal change-point detection using dynamic programming. Example: 3 hour tutorial on optimal change-point detection algorithms presented at international useR 2017 conference in Brussels, Belgium.
Data visualization
Which machine learning algorithm is best for my data set? Which
geographical region is the biggest outlier? What is the state of our
finances this year, relative to previous years? All of these questions
can be convincingly answered, and communicated to others, using data
visualization, which is the art and science of creating graphics that
deliver insights based on your data. Our seminars focus on the
multi-layered grammar of graphics, available via ggplot2
(static
graphics in R), plotnine
(static graphics in python), and animint2
(our R package for animated interactive graphics rendered on web
pages). Examples: The animint2 Manual presented in a tutorial at the
international useR 2016
conference,
Visualizing prediction error blog
post.
R package development
Do you often peform similar analyses in multiple different R scripts? Do you have to update R scripts or functions that were written many weeks/months/years ago, and you wish there was documentation that can remind you about how they work? We teach seminars about organizing R code into packages, complete with functions, tests, documentation, development on GitHub, and easy installation from CRAN. Example: exercises from unsupervised learning class at Northern Arizona University, Fall 2023.
C/C++ code in R packages and python modules
Do you need to use specialized data structures, such as
<map>
container
(red-black
tree) in C++
Standard Template Library? We teach about how to interface R packages
and python modules with C/C++ code, which can sometimes be necessary
for optimal performance. Example: Youtube video tutorial series, Make
an R package with C++
code.
Reproducible analysis and report generation
Do you need to generate reports every day/week/month, based on
constantly updated data sources? We teach reproducible analysis using
rmarkdown
and quarto
, which can be used to generate reports in a
variety of output formats (HTML, PDF, Word docx, etc). Example of
daily report generated using rmarkdown
: new reverse dependency check
report generated every day to make sure data.table
is compatible with
packages that depend on
it.
Time and memory efficient R programming
Do you often have to wait for R code to compute results using large
data sets? Or have you run out of memory? We teach seminars about how
to refactor R code to minimize time/memory usage. Topics include code
profiling to determine what lines/functions would benefit from
optimization, and comparative benchmarking to compare time/memory
usage of different code versions (using our proposed
atime
package). Example: When to
NOT write low-level
code?
Massive CPU parallelism
Do you have lots of data sets, or machine learning
algorithms, or train/test splits in cross-validation, that can each be
computed independently? We teach about how to parallelize
Python
and R code,
using software such as SLURM and batchtools
. Example: data.table
reverse dependency checks involve 1500+ dependent packages, each
checked in parallel, reducing 2-3 weeks of wall time to ~10
hours.
Relative advantages of python and R
Are you an expert in python/pandas, who wonders what are its
advantages/disadvantages relative to R/data.table
, or other
frameworks such as tidyverse
, polars
, duckdb
, arrow
, etc? We
teach about the relative strengths/weaknesses of different
libraries/languages, including discussion about functionality, syntax,
and performance. Example: Comparing data.table
reshape to duckdb
and
polars
.
Database management systems
Do you have to store and analyze data that is too large to load into
memory? We teach about R/python interfaces to databases such as
PostgreSQL, as well as alternative storage formats such as CSV, HDF,
arrow
, duckdb
, etc.
Contact and cost
Please email toby.hocking@r-project.org with a description of the programming/teaching project you would like us to do for you, and we can schedule a free discovery call to discuss our proposed solution, with a quote for how much that would cost. Discounts are possible for academic and non-profit clients.
Two of our most popular packages are:
- Half day package, US$3000, for two 90-minute seminars:
- seminar 8:30-10:00.
- coffee break 10:00-10:30.
- seminar 10:30-noon.
- Full day package, US$6000, for four 90-minute seminars:
- seminar 8:30-10:00.
- coffee break 10:00-10:30.
- seminar 10:30-noon.
- lunch break noon-1:30.
- seminar 1:30-3:00.
- coffee break 3:00-3:30.
- seminar 3:30-5:00.
- Travel and hotel expenses would also be required for an in person seminar (not applicable if you prefer that we create a virtual webinar for your organization).