Table of Contents
The following is an overview and list of resources for some of the programming languages that will be demonstrated during the workshop in varying degrees. This will help you know where to go to install the proper tools for your situation, as well as provide links to help you learn how to use them beyond the workshop.
Python is an extremely popular general-purpose programming language with a long history in software development. More recently, tools have been developed for data science, statistics, machine learning, data visualization and more.
The recommended approach to using Python for data processing and analysis is to install Python and an IDE (Integrated Development Environment1) of your choice to use for actual programming, or just use Anaconda, which will provide Python, an IDE (Spyder), default modules for jumping into data science, and more. Anaconda can be seen as a data science-specific distribution of Python.
- Python 3: The basic Python installation
- Anaconda: Even if you don’t use Anaconda it might be useful to install it for the automatic installation of commonly used modules, Jupyter notebook, and more.
- IDEs: Jupyter Notebook, PyCharm, Anaconda Spyder, Atom, many others
Some of the more common modules used for data science in Python are:
numpy: basic numeric, matrix, and other operations
scipy: mathematics, science, and engineering
pandas: data frames and many other nice data processing features
matplotlib: basic visualization
statsmodels: statistical modeling
scikit-learn: machine learning
In base Python you can use the
pip module to install additional modules. For example:
pip install pandas
If you are using Anaconda, you can type the following at the terminal/command line
conda install pandas
To use a module, you will import it in your script as follows:
Often with an abbreviation:
import pandas as pd
- MOOCs include Coursera, EDX, and others with courses on statistical and data science, programming, and more. University of Michigan offers a couple Python courses in particular.
R is an open-source programming language specifically designed for data processing, statistical computing, and visualization. More recently it has evolved to more general purpose usage, making it among the most popular programming languages even among the general programming languages.
To get started with R, you will need to install it, but you will only use RStudio to actually interact with it. RStudio is an IDE for R that makes programming and many other things much easier.
RStudio is also an organization, and they develop some of the most widely used packages and contribute to many others. If you are using their packages, you are using the same tools for data processing and visualization as most of the R world often uses. To that end, if you install the
tidyverse package, along with those that come with RStudio and base R, you will actually have a collection of packages that will take you very far from data import to publication.
For analysis, the base installation of R is already a very powerful statistical tool, but many packages will take you much further, even if you are doing something base R already does. You can look at CRAN Task Views to get an organized list of packages that are useful for specific types of models, or other tasks such as natural language processing, web scraping, etc.
- caret: machine learning (eventually, tidymodels)
To install an R package, you would type something like the following in your R script:
To use a package:
Because R makes it so easy to publish documents in HTML, PDF, etc.2, many people even write whole textbooks for R programming and make them freely available. Some to be aware of:
But there are many, many, more…
Online Courses etc.
- MOOCs include Coursera, EDX, and others with courses on statistical and data science, programming, and more.
- R bloggers
Structured Query Language (‘sequel’ or ‘SQL’) is a language used for databases, usually relational databases. There are different relational database management systems (RDBMS) such as MySQL, PostgreSQL, Microsoft SQL Server [MSSQL], Oracle, etc.), with corresponding different flavors of SQL.
Even though SQL is an ANSI/ISO standard, there are different flavors of SQL in the market corresponding to the major RDBMS database programs. Therefore, we recommend the use of DBeaver, a cross-platform database tool that will support the major RDBMS programs. DBeaver is also compatible with Windows, Linux, and MacOS machines.
Install the DBeaver Community Edition (https://dbeaver.io/) to your local machine or server. DBeaver needs the Java Runtime Environment (JRE) in order to run. The Windows and MacOS installers already include JRE. For Linux, however, you may need to install JRE manually (https://github.com/dbeaver/dbeaver).
The multi-platform nature of SQL invariably means that most books and instructional materials are geared towards a specific platform. For reference purposes you may want to purchase a book that is specific to the RDBMS your institution uses. Below is a list of introductory books that are multi-platform in their orientation.
- SQL Queries for Mere Mortals
- Sams Teach Yourself SQL in 10 Minutes
- Head First SQL [mostly Standard SQL with a nod towards MySQL]
Online Courses etc.
- About DBeaver
- 10 Easy Steps to a Complete Understanding of SQL
- SQL Tutorial
- A Beginner’s Guide to SQL
Depending on the tools you use, you may not have to use SQL directly, and can stay within your analytical programming language of choice while still interacting with an SQL database.
- IDEs are what you actually do your programming with. They make programming much easier like syntax completion, auto-indent, etc., and often provide other tools e.g. debugging, visualization. You never typically use the base R (console) or Python (IDLE) tool to do programming with.↩
- The document you’re reading was created via R Markdown in RStudio.↩