[TOC]

- Tensorflow supports "half precision" floats -- float16.
  - Float32 is called "full precision"/"single precision".
  - Float64: "double precision".
- Shuffling = re-sharding = ...

#`8 - Algos.pptx` @ 2/28: **Three Common “Design Patterns” in Big Data Analysis**

##**Caching / memo(r)ization**: process a lot of data that repeats/‘partly overlaps’

- Caches

  - In PySpark:

    - pyspark's `.cache()`: useful even just for caching a file-loading process.

    - Spark gives 5 types of Storage level

      - `MEMORY_ONLY`
      - `MEMORY_ONLY_SER`
      - `MEMORY_AND_DISK`
      - `MEMORY_AND_DISK_SER`
      - `DISK_ONLY`

      > `cache()` will use `MEMORY_ONLY`. If you want to use something else, use `persist(StorageLevel.<*type*>)`.
      >
      > By default `persist()` will store the data in the JVM heap as unserialized objects. ([source](https://stackoverflow.com/a/43231985/1147061))

    - It's a trade-off between:

      - IO-efficient and re-useable
      - Redundant but parallel

      Sometimes you may prefer parallelism more than IO-efficiency!

  - the whole **cache** can be considered as a **defaultdict** in Python, defaulting to computing and storing the executionPlan.

    - But it's more than just naive key-lookups -- consider `A join B` and `B join A`: **DFs are in different order but should yield identical results.** There should be a "key interpreter/canonicalizer" that is aware of this (perhaps by a "sort tuple" procedure, as seen in page 11).

  - There's a **limit** (e.g. capacity of your memory) to caches, so:

    - Manually, you should:

      - only cache things you gonna **reuse**
      - remember to `unpersist` when you are done with it
      - be aware of constantly updating data source -- validity of cached DFs may **expire** and thus should be dropped.

    - if by manual selection the cached data is still exceeding the memory's capacity, the computing platform may intervene and prioritize cached items: 

      - either by dropping Least Frequently Used (LFU) items
      - or dropping Least Recently Used (LRU) items.

      _(code demonstrated on page 13, using another dictionary to store last access time)_

- Difference between caching and memoizing: ([source](https://stackoverflow.com/a/6469677/1147061))

  - "memoization" is "caching the result of a **deterministic** function" that can be reproduced at any time given the same function and inputs.
  - "Caching" includes basically **any output-buffering** strategy, whether or not the source value is reproducible at a given time. In fact, caching is also used to refer to *input* buffering strategies, such as the write-cache on a disk or memory. So it is a **much more general** term.

- memoization

  - a key concept in **dynamic programming**

##process many different tasks in ‘parallel’ (3/12/2018)

- **Task scheduling** (Multitasking/task switching)
  - key concepts: ([source](http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/))
    - **Sync:** Blocking operations.
    - **Async:** Non blocking operations.
    - **Concurrency:** Making progress together.
    - **Parallelism:** Making progress in parallel. *Parallelism $\in$ Concurrency.*
  -  Concurrency in Python 
    - **Manual** approach:  `data=operator(data)` approach - using `queue`
    - **Automatic** approach (Great reading: [source](http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/))
      - `multiprocessing` populates multiple processes
      - `threading` uses threads
      - `concurrent.futures` is a simpler interface to those two above
      - `asyncio`


- **Random exploration via genetic algorithms**
  - randomness + parallelism 
  - consider it a non-exhausive search -- by using random sampling.
  - also this makes it a good candidate to make approximate algorithms 

##**Search and constraint solvers**: find an item, a parameter, etc. that maximizes an objective”

- Planning consists of:
  1. Defining start and goal states
  2. Defining a sequence of actions and constraints about how they can be used
  3. Defining a search strategy
  4. Finding pruning methods
- pruning
  - forward chaining
  - **backward chaining** -- allows for more advanced pruning, such as "branch-and-bound pruning"

# @3/14/2018 - **Unsupervised data analysis**

## Principal Component Analysis (PCA) -- Group similar features together (roughly)

- PCA can be viewed as: A rotation to a new coordinate system (a.k.a. "a projection to **a new space**") to **maximize the variance** in the new coordinates.
- PCA is scale-variant -- rescaling data w.r.t. any feature will change PCA result. Also, center ("mean") of data should really be at origin. 
  - Thus, **standardization** is important.
    - but be aware of log-scaled raw features!
- Realized by computing the covariance matrix.
  - Its eigenvectors are our principal components (from original space to PCA-ed space).
    - To compute for them: use single value decomposition (SVD) with fast algorithms like "randomized SVD".
  - Its eigenvalues are the covariance explained. Use eigenvectors in descending order of this!
- Excessive principal compoenents are merely capturing noise in the data.
- compare with: t-distributed Stochastic Neighbor Embedding (t-SNE)
  - non-deterministic (each run of t-SNE, even on the same set of data, can give different results)
  - Local-focused
  - coordinates mean nothing. Information is all in proximity.
  - which leads to the topic of clustering...

##Clustering -- Group similar items together

- K-Means
  - Minimizes within-cluster Sum of Squared Errors (SSE) (a.k.a. distortion function).
  - Non-convex.
  - K-means++: Initializes centroids as far away from each other as possible.
  - K-means in SQL (see `.sql` file)
  - pyspark.mllib.clustering has k-means packaged: KMeans, KMeansModel.
  - Elbow method for choosing the right number of clusters
  - alternative to centroid is the "medoid": while centroid is the numerical average for continuous values, themedoid is for categorical features -- choosing the most representative/frequent point.
- hierichical clustering
  - major difference with k-means: # of clusters is determined iteratively, rather than specified upfront.
  - two approaches: 
    - agglomeative (a.k.a. AHC or HAC): start with single-item clusters and keep merging closest items. <- we will focus on this
    - Divisive: start with one cluster containing all datapoints. keep dividing till all clusters are single-points.
  - distance between clusters
    - **single** linkage: 
      - compute distance between the most **similar** members for each pair of clusters.
      - AHC: Merge the clusters with the **smallest** distance.
    - **Complete** linkage: 
      - Compute distance between the most **different** members for each pair of clusters. 
      - AHC: Merge the clusters with the **smallest** distance.
  - implementation in spark
    - spark.mllib contains divisive, not agglomeative
    - UBer has their own implementation of AHC
  - pros and cons
    - Pros:
      - plots dendrograms -- helps taxonomy
      - can stop at any number of clusters at will
    - Cons:
      - Does not scale well
      - (K-means too) assumes clusters have **spherical shape**
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  - each point is assigned to one label: (great read: [source](https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf))
    - A **point** is a core **point** if it has more than a specified number of **points** (MinPts) within Eps—These are **points** that are at the interior of a **cluster**.
    -  A **border point** has fewer than MinPts within Eps, but is in the neighborhood of a core **point**. 
    - A noise **point** is any **point** that is not a core **point** nor a **border point**.
  - **Handles non-spherically-clustered datasets**!
  - Noise tolerant.

###How to they compare - time complexity:

(n: # of points; d: # of dimensions; k: # of clusters.)

- k-means: `d * n * k * # of iterations * # of restarts`
- Agglomerative: `>dn^2`
- DBSCAN: depends on thregion size; suffers from thecurse of dimensionality

### Other Techniques

- Locally sensitive hashing: ...
- MinHash: Hashing for Jaccard Distance

# Supervised Learning

## @03/26/2018 - decision trees

- types of supervised learning

  - **classification**: y is categorical
  - **Regression**: y is continuous

- Model may be:

  - Parametric: a known funcgional form -- we are here to estimate values
    - Linear and logistic regression
  - Non-parametric: no functional formv assumed
    - decision tres and random forests
    - boosted models
  - semi-parametric: so many parameters as to be effectively non-parametric
    - e.g. neural networks

- Decision trees

  - can be used for feature selection

  - splits first on the option that provides the highest Information Gain

  - The more uniformly distributed the data is, the lower a Gini Index it has, the higher an Entropy it has.

  - very susceptible to overfitting

    - if features are highly correlated, use PCA first
    - balance training data amount for each label
    - limit maximum depth
      - keep minimum number of samples for a split from being too low 
      - prune after training

  - Ensemble version: random forest

    - bootstrap -- sample (with replacement) for each member (individual decision trees).
    - majority vote (or average) members' outputs.

  - Variation: extremely random forests

    - Not only training data is randomly sampled, 

      > instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the **best of these randomly-generated thresholds** is picked as the splitting rule. ([source](http://scikit-learn.org/stable/modules/ensemble.html))

  - d-tree can tell repetitive features and mark them "less important".

  - summary

    - decision trees
      - Fast to train
      - easily interpretable
    - random forests
      - highly accurate
      - does not require hyperparameter search
    - both
      - scale **invariant** -- these models are non-parametric!
      - handles both numerical and categorical data

## @03/28/2018 - regression, boosting and SVMs

### Regression

- **Parametric or not?** Linear regression models, and therefore logisitic regression too, are examples of parametric models.
- **A Word On Regularization**: No matter L2 regularization ("ridge"), L1 regression ("lasso") or PCR, they all decides for each feature how much it should be suppressed before fed into a linear regression model.

####The Linear Regresion Family

- Linear Regression
  - minimizes root mean squared error (RMSE)
    - in practice, the `sqrt()` is omitted - minimizes "MSE".
  - brightside: it has a closed-form solution, but there's potential problem: 
    - space complexity: $X^T\cdot X$ can be big -- $n^2$. May not fit in memory.
    - time complexity: $n\cdot d^2$ (d is the # of features) -- can be big.
  - If d>n, then the X-inversion step will hit a singular point error and fail.
  - Scale invariant -- needs no normalization
- **ridge regression** := Linear regression + $\lambda w^T\cdot w$
  - Minimizes MSE + L2 penalty
    - a.k.a. "ridge", "L2 regularization", etc.
  - idea: shrinks all the weights a little -- just shrink, not making any feature's weight go to zero.
  -  NOT LONGER scale invariant
- **Elastic net** := linear regression + $\lambda_1 w^T\cdot w$ + $\lambda_2$ L1 penalty
  - L1 penalty is also called the "lasso".
  - Idea: L1 can drive least-important features' weights to zero.
  - now you have two hyperparameters for regulation: $\lambda_1$  and $\lambda_2$

#### Principal Component Regression

1. Do PCA on X.
2. Project X onto the PCXA loadings: $T=X\cdot W$ (nxk = nxp * pxk)
3. Use T as training data instead of X. Usually linear regression too.
4. To predict using this PCR model, project test examples to W, then feed into the "linear regression" sub-model. 

- Often used in computational linguistics.
- Different from the ones introduced above: PCR can be called a "semi-supervised learning algorithm"
  - PCA part is unsupervised
  - actual regression part is supervised.
- just like PCA -- not scale-invariant

####Logistic Regression := linear regression + filtering using logistic function + binarizing results to 0&1

- for binary labels (thus for classfication, rather than prediction)
- Since the output will be binarized to 0&1, the value before this binarization is considered to be the probability -- probability estimate: $\sigma(w^T\cdot X)$

### Boosting

- train a series of dumb classifiers, each one focusing more on the examples mis-classified by the previous one.
- is an ensemble method

### SVM

- also for classification.
- maximizing margin...
- "Soft margin": we can trade-off between margin width and violations of the border.
- Usually uses "hinge loss" -- don't care about correctly-classified examples.
- kernels - feature engineering (not covered)
  - linear
  - polynomial (order can be configured)
  - radial basis (a.k.a. gaussian) 
- when use a kernel, not scale-invariant.

## @04/02/2018: Tuning and evaluating classifiers

- Complexity - we want to find the optimal complexity that balances the training error and test error -- avoid underfitting and overfitting.
  - visualize with learning curves.
- k-fold cross validation
  - Holdout sets: part of training data held out and used like a test set.
- some curves:
  - validation curve: x-axis is the value of hyper-parameter; y-axis is the score. ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_validation_curve_0011.png)
    - take the peak.
  - learning curve: x-axis is the percentage. ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_0021.png)
    - Validation/test score should be also high. Don't let it drop -- it would be overfitting.
  - ROC curves:
    - ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_0021.png)
- confusion matrix and all of those performace metrics. (probably omittable)
  -  accuracy is often the wrong measure.
  - some are: error, precision, recall, F1 score.

## @4/4/2018 - Artificial Neural Networks

- gradient descent
  - analytical or numerical derivative
  - variations in terms of training data granuity:
    - **(vanilla) gradient descent** requires all training data to be loaded to compute the gradient.
    - **<u>stochastic</u> gradient descent** calculates the gradient on a per-example basis, allowing for online-learning.
    - **Mini-batch**: updates the model every $k$ observations -- a hybrid approach to the two above.
  - Can get caught in **local minima** -- alternative, **simulated annealing**, uses randomness.
    - according to a "cooling schedule", is initially more likely to randomly jump.
    - still no gurantee, but lready less susceptable to local minima.
-  logistic regression is actually a basic "artificial neuron"
  - activation function
    - uses sigmoid function as "model" -- in this sense, it's similar to logisitic regression.
    - Alternative to sigmoid function if you want a **hard classifier**: <u>heaviside function</u>.
  - A neuron can take multiple inputs -- just like thsoe features inputed in logisitic regression.
    - with an extra "always-on" (i.e. always emitting $1$) input, called "bias". Its effect to the neuron's decision is controlled by its associated weight instead of its value itself.
      - regularization is NOT applied on it.
  - $\eta$ is the learning rate.
- A perceptron is a single layer of many such neurons.
  - consider this as multiple regressions running at once (each neuron being one).
  - still learns a linear boundary -- gurantees conergence if only the training data is linearly seperable.
    - to deal with non-linearly-separable data,
      - SVM uses kernels
      - neural nets use extra layers. See below.
        - (doesn't mean that technically we cannot use kernels with perceptrons)
- deeper networks are just models of multiple layers as such -- feed-forward networks
  - can deal with non-linearly-separable data.
  - **structure:** Input layer -> hidden layer(s) -> output layer
    - each layer can have multiple neurons.
  - activation function: usually ReLU ("rectified linear") here.

### @4/9/2018 - Convolutional Neural Networks

Great reading: [source](http://cs231n.github.io/convolutional-networks/).
![](http://cs231n.github.io/assets/nn1/neural_net2.jpeg)
- for images

- ==CNN uses local receptive field==

- types of layers
  - **Convolution** connects the *receptive field* to a neuron in the next layer
    - often with overlap (“strides”). "Stripe" is the steplength.
      - Will hit border of image -- needs zero-padding (how much? also a hyperparam)
    - kernels in CNN are called "filters".
      - > The spatial extent of this connectivity is a hyperparameter called the *receptive field* of the neuron (equivalently this is the filter size).
  - **Pooling** does aggregation (often: max) 
    - pooling ("downsampling"): most often max-pooling, etc.
  - **Detection** uses sigmoid or RLU
    - usually uses ReLU as activation function -- Cheaper to calc deriv

- ==back propagation==.

- ==ways of regularization==:
  - L2
  - Max norm (L∞)
  - Early stopping
  - ==Dropout==: randomly dropping out connections between layers (?).

- techniques about **Gradient descent**
  - Gradient descent
  - - stochastic
    - gradient clipping
  - ==Minibatch==
  - Momentum
  - Learning rate adaptation

- learning rate adaption:
  - Adagrad: make the learning rate depend previous changes in *each* weight

- **Semi-supervised learning**

#@04/11/2018 - Time Series 

- What makes a TS different from say a regular regression problem? 

  1. It is **time dependent**. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  2. Along with an increasing or decreasing trend, most TS have some form of **seasonality trends**, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over time, you will invariably find higher sales in winter seasons.

- some techniques

  - use a moving average to smooth out short-term fluctuations and highlight longer-term trends or cycles.
  - use `log()` if data appears to be  exponential -- this can give something showing more linearity -- easier to deal with.

- Time series are **stationary** if they do not have trend or seasonal effects.

  - Summary statistics (such as the mean or variance) calculated over stationary time series are consistent over time.

  - Stationarity is an assumption underlying many statistical procedures used in time series analyses, so non-stationary data is often transformed to become stationary. **How to make a time series stationary**:

    - **Estimate and eliminate trend**

    - - **Transformation** – e.g. take log, which penalizes higher values more than smaller ones.

      - **Aggregation** – take average for a time period like monthly/weekly averages

      - **Smoothing** – take rolling (moving) averages. 

      - -  “weighted moving average” gives more recent values a higher weight than older values 
        - In an “**exponentially weighted moving average**”, weights are assigned to all the previous values with a decay factor. 
          - No data is left beind -- all taken in to consideration.

      - **Polynomial fitting** – fit a regression model

    - **Remove trend and seasonality**

    - - **Differencing** – taking the difference with a particular time lag
      - **Decomposition** – modeling both trend and seasonality and removing them from the model.

- Converting a time series from one frequency to another

  - Downsampling – higher to lower frequency
  - Upsampling – lower to higher

- **Statistical test: Augmented Dickey-Fuller (ADF)**

  - Tests the **null hypothesis** that the time-series is non-stationary. 
  - The more negative the Test Statistic is, the stronger the rejection of the hypothesis.

- how to make predictions with time series data

  - Models

    - **Auto-Regressive model AR(p)**

      - x(t) = c0 + ct-1 x(t-1) + ct-2 x(t-2) +….+ ct-p x(t-p) + É√t

    - **Moving Average model (MA(q))**

      - x(t) = c0 + c t-1 e(t-1) + c t-2 e(t-2) +….+ c t-q e(t-q) + É√t
      - e(t) =  É√t = error in prediction at time t

    - **Auto-Regressive Integrated Moving Averages (ARIMA)** forecasting

      - linear equation based three parameters, (p, d, q)
        - **p auto-regressive (AR) terms** (a.k.a. "lags of dependent variable"). For instance if p is 5, the predictors for x(t) will be x(t-1)….x(t-5).
        - **q moving average terms (MA)** (a.k.a. "lagged forecast errors in prediction equation"). For instance if q is 5, the predictors for x(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average at ith time and actual value.
        - **d non-seasonal differences** 

    - - **How to determine p and q?**  
        - Plot autocorrelation functions and partial autocorrelation functions and see when they cross an upper confidence interval.
        - For this data, p=q=2.

  - technique: **Differencing**

  - - y(t) = x(t) – x(t-1), Then fit the model on y(t)
    - Makes the process more stationary

# @04/16/2018 - TensorFlow 

- TensorFlow is based on a **Computation Graph** executed in parallel.
- All data are represented as tensors -- analogy to columns.
- Two sets of APIs:
  - one resembles sklearn
    - to initialize a classifier, you have to specify feature columns -- different from sklearn.
    - ​
  - one lower level
- TensorFlow column datatypes (`tf.contrib.layers.[...]`)
  - for real-valued features: `real_valued_column`
  - for categorical features: 
    - If you know all possible values: use `sparse_column_with_keys`.
    - If you cannot iterate over all categories (or simply want to allocate ordinal values to categorical values on the fly): use `[...]_hash_bucket`.
      - choose bucket amount wisely -- balance between hashing collisions and memory consumption
    - use numerical feature as categorical: `bucketized_column`
    - For feature combinations: `CrossedColumns`

## Distributed TensorFlow

- Master nodes know the whole Computation Graph.
- Worker nodes know only the operation it's assigned with.
  - worker 0 may be the "parameter server" and hold mutable data (weights, biases, etc.)
  - worker 1 may hold the training data and compute some operations.
- Data can be stored on a Spark cluster (i.e. in HDFS) and **streamed** to the TensorFlow cluster. Yahoo did this: `TensorFlowOnSpark`.

## Recurrent Neural Networks (RNN) - Handles Time Series

- two approaches to time series
  - use a standard neural net or CNN
    - such as a nonlinear AR(k) model
  - use a RNN
    - generalize HMMs or Linear Dynamical Systems
    - only choice if inputs are of varying lengths
- See **standard hidden markov model**: each "stage" takes the previous result as part of input.
- variations
  - Long Short Term Memory (LSTM)
  - gated RNNs
  - stacked RNNs
- can be used...
  - ... to predict the next observation given past observations (like an ARIMA model)
  - ... to map one sequence to another sequence ("encoder-decoder" structure)
    - translates sentences from one language to another
    - chatbot
    - auto-caption image

# @04/18/2018 - Online Learning

- Online learning methods
  - Least mean squares (LMS)
    - 
      Online regression -- L2 error
  - Perceptron
    - Idea
      - if we get it right: no change;
      - if we got it wrong: $\vec w_{i+1} = \vec w_i + \vec y_i \vec x_i$.
        - This makes $w$ look more like $\vec x_i$, thus the hyperplane defined by $\vec w$ is more orthogonal to this example of $\vec{x}_i$.
    - In practice, we use **averaged perceptrons** -- a cheap approximation to **voted perceptons**.
      - Return as its final model the average of all intermediate models
        - sounds like a ensemble learning, but actually it just keeps one single, averaged model at any time.  
      - Better than voted perceptons: run-time nearly as fast as single percepton.
    - can use **kernels**.
    - 
      variation: Online SVM -- Hinge loss
  - 
    Online K-means
- different from Neural nets: we look at each example and throw it away.

## Analyzing Real-Time Data Streams

- **Data Streams**
  - may not be peroidic
  - update most often have a delay from actual event
  - timestamp reported may...
    - not be precise (time sync server problem, etc.)
    - has timezone problem (sometimes needed, sometimes to remove)
  - consider operations on data streams to be "continuous queries".
    - outputs can also be considered data streams
    - types
      - cumulative operations -- performed on all data streamed
      - **Rolling or windowed** operations (e.g. a "last-two-minute window".)
        - tumbling windows: no overlap
        - sliding windows: move a fixed length every step (like a queue)
        - sliding + partitions windows: shards data according to key to two sliding windows
- architechture
  - lambda architechture
  - Apache Spark Streaming
  - ...

# @04/23/2018 - Stream Processing Systems

(a wrap-up of streaming processing)

- Apache Spark
  - component:
    - data streams exposed as ever-changing **dataframes**
    - We define **windows** over streams
      - "joins/merges" make sense on windows, not much on streams themselves.
  - Based on “micro batching” – periodic invocations of Spark Engine on batches
    of tuples
  - To process:
    - `StreamingContext` gets started
    - `awaitTermination()` is run
  - Can process whatever is in the DataStream as a DataFrame
  - Can run `countByWindow()` etc. to get time-based or tuple-based windows
- Apache Storm (and Heron): distributed streams among distributed modules
  - <u>components</u>
    - **spout**: interfaces with the world, **produce** streams 
      - emits <u>lists of tuples</u>
    - **bolt**: receive streams, optionally producing streams + read/update states.
  - <u>structure</u>: spouts and bolts are usually pipelined. They can also be **stacked** (for distributed computing).
  - <u>promises</u> robust execution (even when compared to Spark)
  - can be used to mimic MapReduce structure, 
  - To Storm, `streamparse` is what `pyspark` is to Spark.

# Visualization

- Get a **holistic** sense of **data** as we load + analyze it
  - histograms, scatter plots, correlations, time series
- Understand our **algorithms’ performance**
  - learning curve, validation curve, ROC graph
- Present information as part of a **report** or “**dashboard**”
  - figures illustrating performance (in the economic sense, etc.)
- Potential Kinds of Plots
  - Exploratory graphics – for the data scientist. Want to create rapidly, iterate,
  - Communication graphics – for you to communicate your findings, Again, iteration is important
- “Grammar of Graphics”
  - Basis of R’s ggplot2
    - Python port: `ggplot`
  - Divides plots into:
    - layers
    - data
    - aesthetic mappings
    - geometric objects
    - **statistical transformations**
    - position adjustments
  - scales
  - coordinate system
  - facets (groups)
- more advanced visualization in python
  - `seaborn`: Builds upon `matplotlib` with a focus on statistical plots
  - `lightning` + `d3.js`: “Dashboards”
    - requires server -- Jupyter is also a web server

# @04/25/2018 - Data Science Ethnics

- Why do people do the right thing?
  - Morality (Ethnics)
- Enthical principles
  - **Autonomy**: The right to control your data, possibly via *surrogates*
  - **Informed Consent**: You should explicitly approve use of your data based on understanding
    - required in human-subjects research
      - must understand what is being done
      - must voluntarily consent to the experiment
      - must have the right to withdraw consent at any time
    - but no one requires them in “ordinary conduct of business”
      - consequently we are constantly subject to tests such as A/B tests.
  - **Beneficence**: People using your data should do it for your benefit
    - or at least **Non-maleficence**: Do no harm
- **Differential privacy** aims to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
- we want results that are ...
  - reproducible
  - fair