We are experts in statistical programming – we have been using R/Python/C/C++/JavaScript since 2003 (20+ years). During that time, we have been able to see the advantages and disadvantages of different programming approaches, and we would be excited to share our insights with you, during a half-day or full-day workshop, including seminars on one or more of the following topics.
Seminar topics
data.table for data science
We will teach you how to speed up your data analysis scripts by 10-100x using data.table, which is a state-of-the-art in-memory database system, implemented as an R package with efficient C code.
Our seminars can include a wide range of topics, including basics such as reading/writing CSV, as well as advanced topics such as rolling joins for time series data.
Example:
- 3 hour tutorial with exercises presented at LatinR conference, Montevideo, Uruguay, Oct 2023,
- video of 90 minute talk at RUG meeting, Madrid, Spain, Feb 2025.
torch for deep learning in R and python
Do you want to create applications that use artificial intelligence, trained using machine learning algorithms on big data sets?
(text, image, tabular, time series, etc)
We teach how to use torch, from basics like stochastic gradient descent learning with DataLoaders, to advanced topics like application-specific loss functions, such as the AUM loss which we proposed for learning with imbalanced labels (for example binary classification with 99% negative labels and 1% positive labels).
Examples:
- Introduction to Deep Learning in R, a 2 hour lecture for Research Bazaar Arizona, Flagstaff, April 2023,
- Blog about custom AUM loss function for torch in Python,
- Blog about custom AUM loss function torch in R).
Machine learning with mlr3 package
Do you want to be able to easily compare prediction accuracy of different machine learning algorithms, to help you decide which one you should deploy for your own data sets?
We teach how to do that using the mlr3 framework, which provides an easy-to-use interface for many common and highly effective machine learning algorithms (neural networks, linear models, random forests, boosting, support vector machines, etc).
Advanced seminars include parallelization using mlr3batchmark and mlr3resampling packages.
Examples:
Advanced cross-validation for evaluating machine learning model predictions
Have you ever wondered if your machine learning model trained on this year will still be accurate next year?
Or if a model trained on data from one country would be as accurate in another country?
Or if your machine learning algorithm would get better predictions if you had more training data?
We teach how to answer these questions with the easily paralellizable implementation of cross-validation in our mlr3resampling R package.
Examples:
- New code for various kinds of cross-validation blog post,
- Slides for talk about Same/Other/All K-fold cross-validation (SOAK).
Regular expressions (regex) for text parsing and data reshaping
Does any of your data analysis involve scraping data from web pages, or other text files like server logs?
We teach how to use regex for extracting tabular data from such data.
Refactoring text parsing R code using regex can yield big speedups, since regex libraries use compiled code (C/C++) for text parsing.
Using modern packages like nc, the R code/syntax can be simple, readable, and maintainable!
Examples:
- Regex tutorial with exercises presented at McGill University, Montreal, Quebec, 2015,
- Collaborations not allowed blog post about web scraping.
Sequential and time series data analysis
If you work with time series data, you have probably wondered if there are better analysis algorithms than sliding windows. We teach about time series data forecasting and optimal change-point detection using dynamic programming. Example:
Data visualization
Which machine learning algorithm is best for my data set?
Which geographical region is the biggest outlier?
What is the state of our finances this year, relative to previous years?
All of these questions can be convincingly answered, and communicated to others, using data visualization, which is the art and science of creating graphics that deliver insights based on your data.
Our seminars focus on the multi-layered grammar of graphics, available via ggplot2 (static graphics in R), plotnine (static graphics in python), and animint2 (our R package for animated interactive graphics rendered on web
pages).
Examples:
- The animint2 Manual (presented in a tutorial at the international useR 2016 conference),
- Visualizing prediction error blog post,
- Video at Toulouse-Datavis 2025.
R package development
Do you often peform similar analyses in multiple different R scripts? Do you have to update R scripts or functions that were written many weeks/months/years ago, and you wish there was documentation that can remind you about how they work? We teach seminars about organizing R code into packages, complete with functions, tests, documentation, development on GitHub, and easy installation from CRAN. Example:
C/C++ code in R packages and python modules
Do you need to use specialized data structures, such as <map> container (red-black tree) in C++ Standard Template Library?
We teach about how to interface R packages and python modules with C/C++ code, which can sometimes be necessary for optimal performance.
Example:
Reproducible analysis and report generation
Do you need to generate reports every day/week/month, based on constantly updated data sources?
We teach reproducible analysis using rmarkdown and quarto, which can be used to generate reports in a variety of output formats (HTML, PDF, Word docx, etc).
Examples:
- Daily report created using
rmarkdown: new reverse dependency check report generated every day to make suredata.tableis compatible with packages that depend on it. - “Textbook” web sites created using
quarto: Animint2 manual in English and French.
Time and memory efficient R programming
Do you often have to wait for R code to compute results using large data sets?
Or have you run out of memory?
We teach seminars about how to refactor R code to minimize time/memory usage.
Topics include code profiling to determine what lines/functions would benefit from optimization, and comparative benchmarking to compare time/memory usage of different code versions (using our proposed atime package).
Examples:
Massive CPU parallelism
Do you have lots of data sets, or machine learning algorithms, or train/test splits in cross-validation, that can each be computed independently?
We teach about how to parallelize Python and R code, using software such as SLURM, batchtools, and MPI.
Examples:
- Cross-validation experiments with torch learners.
data.tablereverse dependency checks involve 1500+ dependent packages, each checked in parallel, reducing 2-3 weeks of wall time to ~10 hours.
Relative advantages of python and R
Are you an expert in python/pandas, who wonders what are its advantages/disadvantages relative to R/data.table, or other frameworks such as tidyverse, polars, duckdb, arrow, etc?
We teach about the relative strengths/weaknesses of different libraries/languages, including detailed comparisons of syntax, functionality, and performance.
Example:
- Comparing
data.tablereshape toduckdbandpolars. - Update about data reshaping and visualization in R and python.
- Benchmarking
data.tablewithpolars,duckdb, andpandas
Database management systems
Do you have to store and analyze data that is too large to load into memory?
We teach about R/python interfaces to databases such as PostgreSQL, as well as alternative storage formats such as CSV, HDF, arrow, duckdb, etc.
Contact and cost
Please email toby.hocking@r-project.org with a description of the programming/teaching project you would like us to do for you, and we can schedule a free discovery call to discuss our proposed solution, with a quote for how much that would cost. Discounts are possible for academic and non-profit clients.
Two of our most popular packages are:
- Half day package, US$3000, for two 90-minute seminars:
- seminar 8:30-10:00.
- coffee break 10:00-10:30.
- seminar 10:30-noon.
- Full day package, US$6000, for four 90-minute seminars:
- seminar 8:30-10:00.
- coffee break 10:00-10:30.
- seminar 10:30-noon.
- lunch break noon-1:30.
- seminar 1:30-3:00.
- coffee break 3:00-3:30.
- seminar 3:30-5:00.
- Travel and hotel expenses would also be required for an in person seminar (not applicable if you prefer that we create a virtual webinar for your organization).