Workshop Resources

Table of Contents

Introduction

The following is an overview and list of resources for some of the programming languages that will be demonstrated during the workshop in varying degrees. This will help you know where to go to install the proper tools for your situation, as well as provide links to help you learn how to use them beyond the workshop.

Python

Python is an extremely popular general-purpose programming language with a long history in software development. More recently, tools have been developed for data science, statistics, machine learning, data visualization and more.

Basics

The recommended approach to using Python for data processing and analysis is to install Python and an IDE (Integrated Development Environment1) of your choice to use for actual programming, or just use Anaconda, which will provide Python, an IDE (Spyder), default modules for jumping into data science, and more. Anaconda can be seen as a data science-specific distribution of Python.

Modules

Some of the more common modules used for data science in Python are:

  • numpy: basic numeric, matrix, and other operations
  • scipy: mathematics, science, and engineering
  • pandas: data frames and many other nice data processing features
  • matplotlib: basic visualization
  • statsmodels: statistical modeling
  • scikit-learn: machine learning

In base Python you can use the pip module to install additional modules. For example:


pip install pandas

If you are using Anaconda, you can type the following at the terminal/command line


conda install pandas

To use a module, you will import it in your script as follows:


import pandas

Often with an abbreviation:


import pandas as pd

Books

Online courses

  • DataCamp
  • MOOCs include Coursera, EDX, and others with courses on statistical and data science, programming, and more. University of Michigan offers a couple Python courses in particular.

Help

R

R is an open-source programming language specifically designed for data processing, statistical computing, and visualization. More recently it has evolved to more general purpose usage, making it among the most popular programming languages even among the general programming languages.

Basics

To get started with R, you will need to install it, but you will only use RStudio to actually interact with it. RStudio is an IDE for R that makes programming and many other things much easier.

Packages

RStudio is also an organization, and they develop some of the most widely used packages and contribute to many others. If you are using their packages, you are using the same tools for data processing and visualization as most of the R world often uses. To that end, if you install the tidyverse package, along with those that come with RStudio and base R, you will actually have a collection of packages that will take you very far from data import to publication.

For analysis, the base installation of R is already a very powerful statistical tool, but many packages will take you much further, even if you are doing something base R already does. You can look at CRAN Task Views to get an organized list of packages that are useful for specific types of models, or other tasks such as natural language processing, web scraping, etc.

Others:

  • caret: machine learning (eventually, tidymodels)

To install an R package, you would type something like the following in your R script:


install.packages('tidyverse')

To use a package:


library(tidyverse)

Books

Because R makes it so easy to publish documents in HTML, PDF, etc.2, many people even write whole textbooks for R programming and make them freely available. Some to be aware of:

But there are many, many, more…

Online Courses etc.

  • DataCamp
  • MOOCs include Coursera, EDX, and others with courses on statistical and data science, programming, and more.
  • R bloggers

Help

SQL

Structured Query Language (‘sequel’ or ‘SQL’) is a language used for databases, usually relational databases. There are different relational database management systems (RDBMS) such as MySQL, PostgreSQL, Microsoft SQL Server [MSSQL], Oracle, etc.), with corresponding different flavors of SQL.

Basics

Even though SQL is an ANSI/ISO standard, there are different flavors of SQL in the market corresponding to the major RDBMS database programs. Therefore, we recommend the use of DBeaver, a cross-platform database tool that will support the major RDBMS programs. DBeaver is also compatible with Windows, Linux, and MacOS machines.

Install the DBeaver Community Edition (https://dbeaver.io/) to your local machine or server. DBeaver needs the Java Runtime Environment (JRE) in order to run. The Windows and MacOS installers already include JRE. For Linux, however, you may need to install JRE manually (https://github.com/dbeaver/dbeaver).

Books

The multi-platform nature of SQL invariably means that most books and instructional materials are geared towards a specific platform. For reference purposes you may want to purchase a book that is specific to the RDBMS your institution uses. Below is a list of introductory books that are multi-platform in their orientation.

Online Courses etc.

Help

Depending on the tools you use, you may not have to use SQL directly, and can stay within your analytical programming language of choice while still interacting with an SQL database.


  1. IDEs are what you actually do your programming with. They make programming much easier like syntax completion, auto-indent, etc., and often provide other tools e.g. debugging, visualization. You never typically use the base R (console) or Python (IDLE) tool to do programming with.
  2. The document you’re reading was created via R Markdown in RStudio.