# 16 Coding Resources

Learning to code for data wrangling and analyses will be a significant component of your education during the MS or PhD program. Some people will take these skills forward and continue using them in their career after graduate training, but even if you don’t, use these skills extensively yourself as you move into your career, you will likely be overseeing people who do, and it’s good to know general principles of coding, how problems arise in coding that are sometimes difficult to diagnosis, and how to overcome these problems, even if you’re not the one directly coding your analyses for the rest of your career.

What will you learn in our program? Primarily SAS and R, but you’re welcome to explore other platforms as well during your time here.

Other platforms, for example, SPSS and python, get some use in our department, but are not particularly widespread, though perhaps that will change (particularly for python) with the new AI initiatives at UF.

## 16.1 SAS

As noted above, SAS is the primary platform used in our program. We use SAS fairly extensively in our lab’s work as well, in part because it’s particularly good at working with massive datasets. SAS is available on all of our Virtual Machines/servers, and UF offers discounted annual licenses for SAS (as well as a free cloud-based SAS through UFApps) for enrolled students. More info on individual licenses can be found here for students and here for faculty/postdocs/staff. Note that staff/postdocs should get their license through the Department by contacting Carl Henriksen.

Some folks like working in base SAS by itself. Others prefer SAS Enterprise Guide, which wraps around base SAS and provides some additional functionality. Try each, and see what you prefer.

One downside to SAS is it does not run natively on MacOS, so if you have a Mac, you’ll need Parallels, VMware, or similar hardware virtualization to create a windows drive, if you want SAS on your own system.

### 16.1.1 Books

There are lots of good SAS books out there, but here’s a couple you might find particularly useful. (* denotes texts Dr. Smith has electronic copies of and that can be ‘checked out’ within the lab.)

The little SAS book, 6th ed. (you can access this one from campus or on the VPN - Dr. Smith also has a copy of 5th ed.*) (Delwiche and Slaughter 2019)

Analysis of Observational Health Care Data using SAS* (Faries and Institute 2010)

Survival analysis with SAS: A practical guide* (Allison 2010)

### 16.1.2 Useful online articles/links/blogs

SAS Procedures by Name - This is a must-have bookmark to the official SAS documentation; you will use it often and it’s quite helpful.

UCLA Office of Advanced Research Computing tutorials - a good starting place for basics of running relatively simple analyses/data wrangling and interpreting.

UF PHC 6052 course tutorials - you’ll take this class relatively early in the program, but still a useful resource

The DO loop - excellent and very productive blog by Rick Wicklin

LexJansen.com - not a particularly user-friendly site, but contains tons of SAS-related papers. Your best bet is just googling your problem, but there’s a good chance the top hits will be papers in PDF form on this site.

### 16.1.3 Macros

Squeeze - shrink datasets by minimizing variable lengths to minimum necessary for the actual dataset

[Magic Macro] - to add

OptionReset - reset default options, if you’ve somehow mangled yours

ms_freezedata - a mini-SENTINEL program macro that creates subsets of patient-level data from a supplied patient id list

[Table 1] - to add

## 16.2 R

### 16.2.1 Base R

You can download base R for free from C-RAN here. Make sure you select the correct file for your computer system (and chip, if using a Mac).

Installation should be straight-forward and easy, and you can use defaults. If needed, there are comprehensive instructions (HTML and PDF) available on C-RAN.

The R development team has some good, if somewhat dense, manuals available at C-RAN here. A good place to start is the Intro to R (HTML and PDF).

### 16.2.2 R-Studio and extensions

We recommend you use R-studio on top of R. Select the free desktop version here (*note: skip to step 2 since you will have hopefully already installed base R*). Again, make sure you download the correct file for your operating system and note that the website often assumes you run Windows - if you don’t, scroll down a bit further to make sure you get the right file for your OS.

Again, installation should be straight-forward and easy, and you can use defaults.

There are some useful extensions for R-studio, but none are absolutely necessary:

- Quarto - useful for generating all sorts of R-markdown, including papers (yes, you could actually write your manuscript in Quarto), technical reports, websites, books (this lab manual was written with Quarto!); the beauty of R-markdown is you can weave together plain text and R code seamlessly into an output document (.html, .docx, .pdf, etc) - that means you can have all of your analysis code re-generate everything automatically any time you make a slight change to the cohort or underlying data.
- One ‘to-do’ for the lab might be to make a manuscript template which would make writing routine parts of manuscripts considerably easier 🤔

### 16.2.3 R packages

#### Data wrangling

The following packages are particularly useful for dealing with raw data as well as basic analyses:

tidyverse - a suite of packages that make R considerably easier to learn for the new user (in our opinion); bonus points because they’re supported by Posit (makers of R-Studio) and are constantly being improved, unlike some packages which eventually languish. We suggest installing the entire tidyverse with

`install.packages("tidyverse")`

but you can also install components of the tidyverse individuallydata.table - an alternative to working with “tibbles” (dataframes in tidyverse), data.table offers a high-performance (fast) version of base R’s data.frame with syntax and feature enhancements; some people prefer

`data.table`

to tidyverse/tibbles, and once you learn the syntax (which can be a little awkward), it is a pretty useful package, particularly when working with larger datasets like we uselabelled - If you’ve gotten enough experience with R and SAS (or SPSS), you might notice one of the distinct differences: SAS/SPSS allow for labeling variables, R does not. This package provides that functionality to R. You know what your variables are, but for the rest of us who don’t, labeling helps a lot

janitor - has some useful data cleaning functions; nothing super complicated, but it will save you some time, particularly when bringing in messier data

More to come…

#### Graphics & Tables

ggplot2 - comes with tidyverse, so does not need separate installation, but will be your go-to for plotting much of what you’ll want.

plotly - interactive graphics

gt - great tables(?) - a tidyverse-syntaxed tables package that has some very nice extensions, see below; note that there are lots of other output table-oriented packages (the gt page lists many of these other packages) - and you may find some of these more to your liking.

gtExtras - extension for {gt} that adds some nifty functionality, including small plots in table rows

gtsummary - extension for {gt} for creating summary tables. See Lab Docs on the CVmedLab website for an example using gtsummary.

#### Interactive data presentation

- shiny - build interactive web apps and dashboards from R

#### Package development

Packages come in all shapes and sizes and don’t have to be a set of fancy functions to be used by the R community. A common use of R packages is collating everything needed for an analysis/paper (data, notes, analytic code, +/- the paper itself). If you’re going to create packages, the following are extremely helpful:

#### Analysis

Tons of packages in this space, but if you want to stick with the tidyverse, tidymodels is a good choice.

For regression modeling, Frank Harrell’s rms package is good and well supported with a website.

#### Color palettes

paletteer - most of the R color palettes floating around in space assembled in one package

#### Other Odds-and-Ends Packages

### 16.2.4 Useful R resources

r4ds - R for Data Science, by Hadley Wickham, is an excellent, free introduction to R and the tidyverse

Advanced R - A step up in complexity from r4ds (and best tackled after r4ds), but another excellent book by Hadley Wickham

ggplot2 Book - Excellent book, again by Wickham, that overviews ggplot2 capabilities.

Biostatistics for Biomedical Research - online course covering lots of statistical (and more broadly, biomedical research) topics from the excellent statistician, Frank Harrell; this is more generally biostatistics-focused, but works through accompanying R code

Harrell’s Regression Modeling Stragies Course (RMSC) - Frank Harrell’s online book/“course” that accompanies the excellent RSM textbook (Harrell 2015)

R pkgs - Very good book on developing R packages

Big Book of R - A frequently-updated collection of (probably) every single online R text there is; probably most of this stuff will not be useful to you, but if you’re looking for something - there’s a good chance you can find it here.

## 16.3 Git/Github

## References

*Survival Analysis Using SAS: A Practical Guide*. 2. ed. Cary, NC: SAS Press.

*The Little SAS Book: A Primer*. Sixth edition. Cary, NC: SAS Institute.

*Analysis of Observational Health Care Data Using SAS*. Cary, North Carolina: SAS Publishing.

*Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis*. 2. ed. Springer Series in Statistics. Cham: Springer.