Open source environment for deep analysis of large complex data

The Power of R with Big Data

Apply the thousands of statistical and visualization methods in the R language with simple commands over back ends like Hadoop - without being an expert in distributed computing.

Tessera components

Get Started in Minutes

Tessera is a powerful computational environment for data large and small. From installation on a single workstation to the Amazon cloud, we've made it easy for you to get started.

Quickstart guide

Resources to Learn & Join

Once you are up and running, check out our detailed documentation, join our mailing list, browse the code, and learn how to join the open source team!

Resources

Tessera Components


The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.


Divide and Recombine (D&R)

Tessera is powered by Divide and Recombine. In D&R, we seek meaningful ways to divide the data into subsets, apply statistical methods to each subset independently, and recombine the results of those computations in a statistically valid way. This enables us to use the existing vast library of methods available in R - no need to write scalable versions.

Read the D&R Paper
divide and recombine

datadr

The datadr R package provides a simple interface to D&R operations. The interface is back end agnostic, so that as new distributed computing technology comes along, datadr will be able to harness it. Datadr currently supports in-memory, local disk / multicore, and Hadoop back ends, with experimental support for Apache Spark. Regardless of the back end, coding is done entirely in R and data is represented as R objects.

datadr Tutorial Source on Github
datadr

Trelliscope

Trelliscope is a D&R visualization tool based on Trellis Display that enables scalable, flexible, detailed visualization of data. Trellis Display has repeatedly proven itself as an effective approach to visualizing complex data. Trelliscope, backed by datadr, scales Trellis Display, allowing the analyst to break potentially very large data sets into many subsets, apply a visualization method to each subset, and then interactively sample, sort, and filter the panels of the display on various quantities of interest.

Trelliscope Tutorial Source on Github
trelliscope

RHIPE

RHIPE is the R and Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R operations directly through RHIPE , although in this case you are programming at a lower level.

RHIPE Tutorial Source on Github
rhipe

QuickStart


Install

Tessera is available as a local stack that can be installed on a single workstation and a full stack to be installed on a cluster. In either case, we provide straightforward options for you to get your Tessera environment up and running.

The local stack simply consists of the following:

  • R
  • datadr R package
  • Trelliscope R package

The full stack has the above components as well as the the following on each node of the cluster:

  • Hadoop
  • RHIPE Hadoop connector R package
  • RStudio Server (namenode only)
  • Shiny Server (namenode only)
It is also possible for other connectors and distributed computing technologies to be replaced by Hadoop and RHIPE, such as Spark and SparkR. Regardless, your code stays virtually the same.

Local Stack on Workstation

After installing R, simply launch R, install the devtools package from CRAN, and then install the datadr and Trelliscope R packages:

install.packages("devtools") # if not installed
library(devtools)
install_github("tesseradata/datadr")
install_github("tesseradata/trelliscope")

Now you are ready to try out the quick start code or begin working through the tutorials, and your environment is suitable for analyzing small to moderate (low gigabyte) data.

Full Stack on Vagrant VM

To get a feel for running in a large-scale Tessera environment, we have provided a Vagrant setup that with a few simple commands allows you to provision a virtual machine on your workstation with the full Tessera stack running.

The Vagrant script and instructions are available on Github.

Full Stack on Amazon Web Services

We have provided an easy way to get going with Tessera in a large-scale environment through a simple set of scripts that provision the Tessera environment on Amazon's Elastic MapReduce (EMR). This allows you to spin up virtual clusters on-demand. An Amazon account is required.

This environment comes with RStudio Server running on the master node, so that all you need is a web browser to access R Studio, a fantastic R IDE that will be backed by your own Hadoop cluster.

The EMR scripts and instructions are available on Github.

Full Stack on Your Cluster

Setting up and installing all of the Tessera components on your own cluster will require more commitment in terms of hardware, installation, configuration, and administration. We have put together an installation manual that is available here.

Try It

Here is a simple example to get a feel for Tessera usage. Commentary about the example is available in the datadr tutorial here. For more compelling examples of Tessera in action, as well as in-depth tutorials, check out the Resources section and the blog.

# install package with housing data
devtools::install_github("hafen/housingData")
library(housingData)
library(datadr); library(trelliscope)

# look at housing data
head(housing)

# divide by county and state
byCounty <- divide(housing, 
  by = c("county", "state"), update = TRUE)

# look at summaries
summary(byCounty)

# look at overall distribution of median list price
priceQ <- drQuantile(byCounty, 
  var = "medListPriceSqft")
xyplot(q ~ fval, data = priceQ, 
  scales = list(y = list(log = 10)))

# slope of fitted line of list price for each county
lmCoef <- function(x)
  coef(lm(medListPriceSqft ~ time, data = x))[2]
# apply lmCoef to each subset
byCountySlope <- addTransform(byCounty, lmCoef)

# look at a subset of transformed data
byCountySlope[[1]]

# recombine all slopes into a single data frame
countySlopes <- recombine(byCountySlope, combRbind)
plot(sort(countySlopes$val))

# make a time series trelliscope display
vdbConn("housingjunk/vdb", autoYes = TRUE)

# make and test panel function
timePanel <- function(x)
  xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
    data = x, auto.key = TRUE, ylab = "$ / Sq. Ft.")
timePanel(byCounty[[1]][[2]])

# make and test cognostics function
priceCog <- function(x) { list(
  slope = cog(lmCoef(x), desc = "list price slope"),
  meanList = cogMean(x$medListPriceSqft),
  listRange = cogRange(x$medListPriceSqft),
  nObs = cog(sum(!is.na(x$medListPriceSqft)), 
    desc = "number of non-NA list prices")
)}
priceCog(byCounty[[1]][[2]])

# add display panel and cog function to vdb
makeDisplay(byCounty,
  name = "list_sold_vs_time",
  desc = "List and sold price over time",
  panelFn = timePanel, cogFn = priceCog,
  width = 400, height = 400,
  lims = list(x = "same"))

# view the display
view()
              

You can view this and some related Trelliscope displays here.

Resources


Tutorials

The best way to get started digging deeper into Tessera is to follow the tutorials for our software components in the following order:

  • datadr
    Learn how to program D&R with datadr
  • Trelliscope
    Learn how to create detailed interactive D&R visualizations
  • RHIPE
    Learn the details of the datadr Hadoop back end

In addition, we have developed other tutorials and analysis narratives that further illustrate the use of the Tessera tools in more realistic data analysis situations. These are available here and on our blog.

Publications

The following publications provide more detail about research relevant to Tessera, as well as illustrate the principles of D&R in various applications:

Get Help / Stay Connected

Get Involved

About Us


Our team consists of faculty, students, and technical staff at the Purdue University Department of Statistics, statisticians and computer scientists at Pacific Northwest National Laboratory and Mozilla Corporation, and independent contributors.

Tessera team members collectively cover all of the intellectual areas of data science, from cluster hardware design for big data to theoretical statistics. Most important is our team's experience in deep analyses of many big datasets. These experiences motivated our development of the Tessera tools in the first place--and they will drive our continued development in the future.

logos