[TOC] - Tensorflow supports "half precision" floats -- float16. - Float32 is called "full precision"/"single precision". - Float64: "double precision". - Shuffling = re-sharding = ... #`8 - Algos.pptx` @ 2/28: **Three Common “Design Patterns” in Big Data Analysis** ##**Caching / memo(r)ization**: process a lot of data that repeats/‘partly overlaps’ - Caches - In PySpark: - pyspark's `.cache()`: useful even just for caching a file-loading process. - Spark gives 5 types of Storage level - `MEMORY_ONLY` - `MEMORY_ONLY_SER` - `MEMORY_AND_DISK` - `MEMORY_AND_DISK_SER` - `DISK_ONLY` > `cache()` will use `MEMORY_ONLY`. If you want to use something else, use `persist(StorageLevel.<*type*>)`. > > By default `persist()` will store the data in the JVM heap as unserialized objects. ([source](https://stackoverflow.com/a/43231985/1147061)) - It's a trade-off between: - IO-efficient and re-useable - Redundant but parallel Sometimes you may prefer parallelism more than IO-efficiency! - the whole **cache** can be considered as a **defaultdict** in Python, defaulting to computing and storing the executionPlan. - But it's more than just naive key-lookups -- consider `A join B` and `B join A`: **DFs are in different order but should yield identical results.** There should be a "key interpreter/canonicalizer" that is aware of this (perhaps by a "sort tuple" procedure, as seen in page 11). - There's a **limit** (e.g. capacity of your memory) to caches, so: - Manually, you should: - only cache things you gonna **reuse** - remember to `unpersist` when you are done with it - be aware of constantly updating data source -- validity of cached DFs may **expire** and thus should be dropped. - if by manual selection the cached data is still exceeding the memory's capacity, the computing platform may intervene and prioritize cached items: - either by dropping Least Frequently Used (LFU) items - or dropping Least Recently Used (LRU) items. _(code demonstrated on page 13, using another dictionary to store last access time)_ - Difference between caching and memoizing: ([source](https://stackoverflow.com/a/6469677/1147061)) - "memoization" is "caching the result of a **deterministic** function" that can be reproduced at any time given the same function and inputs. - "Caching" includes basically **any output-buffering** strategy, whether or not the source value is reproducible at a given time. In fact, caching is also used to refer to *input* buffering strategies, such as the write-cache on a disk or memory. So it is a **much more general** term. - memoization - a key concept in **dynamic programming** ##process many different tasks in ‘parallel’ (3/12/2018) - **Task scheduling** (Multitasking/task switching) - key concepts: ([source](http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/)) - **Sync:** Blocking operations. - **Async:** Non blocking operations. - **Concurrency:** Making progress together. - **Parallelism:** Making progress in parallel. *Parallelism $\in$ Concurrency.* - Concurrency in Python - **Manual** approach: `data=operator(data)` approach - using `queue` - **Automatic** approach (Great reading: [source](http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/)) - `multiprocessing` populates multiple processes - `threading` uses threads - `concurrent.futures` is a simpler interface to those two above - `asyncio` - **Random exploration via genetic algorithms** - randomness + parallelism - consider it a non-exhausive search -- by using random sampling. - also this makes it a good candidate to make approximate algorithms ##**Search and constraint solvers**: find an item, a parameter, etc. that maximizes an objective” - Planning consists of: 1. Defining start and goal states 2. Defining a sequence of actions and constraints about how they can be used 3. Defining a search strategy 4. Finding pruning methods - pruning - forward chaining - **backward chaining** -- allows for more advanced pruning, such as "branch-and-bound pruning" # @3/14/2018 - **Unsupervised data analysis** ## Principal Component Analysis (PCA) -- Group similar features together (roughly) - PCA can be viewed as: A rotation to a new coordinate system (a.k.a. "a projection to **a new space**") to **maximize the variance** in the new coordinates. - PCA is scale-variant -- rescaling data w.r.t. any feature will change PCA result. Also, center ("mean") of data should really be at origin. - Thus, **standardization** is important. - but be aware of log-scaled raw features! - Realized by computing the covariance matrix. - Its eigenvectors are our principal components (from original space to PCA-ed space). - To compute for them: use single value decomposition (SVD) with fast algorithms like "randomized SVD". - Its eigenvalues are the covariance explained. Use eigenvectors in descending order of this! - Excessive principal compoenents are merely capturing noise in the data. - compare with: t-distributed Stochastic Neighbor Embedding (t-SNE) - non-deterministic (each run of t-SNE, even on the same set of data, can give different results) - Local-focused - coordinates mean nothing. Information is all in proximity. - which leads to the topic of clustering... ##Clustering -- Group similar items together - K-Means - Minimizes within-cluster Sum of Squared Errors (SSE) (a.k.a. distortion function). - Non-convex. - K-means++: Initializes centroids as far away from each other as possible. - K-means in SQL (see `.sql` file) - pyspark.mllib.clustering has k-means packaged: KMeans, KMeansModel. - Elbow method for choosing the right number of clusters - alternative to centroid is the "medoid": while centroid is the numerical average for continuous values, themedoid is for categorical features -- choosing the most representative/frequent point. - hierichical clustering - major difference with k-means: # of clusters is determined iteratively, rather than specified upfront. - two approaches: - agglomeative (a.k.a. AHC or HAC): start with single-item clusters and keep merging closest items. <- we will focus on this - Divisive: start with one cluster containing all datapoints. keep dividing till all clusters are single-points. - distance between clusters - **single** linkage: - compute distance between the most **similar** members for each pair of clusters. - AHC: Merge the clusters with the **smallest** distance. - **Complete** linkage: - Compute distance between the most **different** members for each pair of clusters. - AHC: Merge the clusters with the **smallest** distance. - implementation in spark - spark.mllib contains divisive, not agglomeative - UBer has their own implementation of AHC - pros and cons - Pros: - plots dendrograms -- helps taxonomy - can stop at any number of clusters at will - Cons: - Does not scale well - (K-means too) assumes clusters have **spherical shape** - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) - each point is assigned to one label: (great read: [source](https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf)) - A **point** is a core **point** if it has more than a specified number of **points** (MinPts) within Eps—These are **points** that are at the interior of a **cluster**. - A **border point** has fewer than MinPts within Eps, but is in the neighborhood of a core **point**. - A noise **point** is any **point** that is not a core **point** nor a **border point**. - **Handles non-spherically-clustered datasets**! - Noise tolerant. ###How to they compare - time complexity: (n: # of points; d: # of dimensions; k: # of clusters.) - k-means: `d * n * k * # of iterations * # of restarts` - Agglomerative: `>dn^2` - DBSCAN: depends on thregion size; suffers from thecurse of dimensionality ### Other Techniques - Locally sensitive hashing: ... - MinHash: Hashing for Jaccard Distance # Supervised Learning ## @03/26/2018 - decision trees - types of supervised learning - **classification**: y is categorical - **Regression**: y is continuous - Model may be: - Parametric: a known funcgional form -- we are here to estimate values - Linear and logistic regression - Non-parametric: no functional formv assumed - decision tres and random forests - boosted models - semi-parametric: so many parameters as to be effectively non-parametric - e.g. neural networks - Decision trees - can be used for feature selection - splits first on the option that provides the highest Information Gain - The more uniformly distributed the data is, the lower a Gini Index it has, the higher an Entropy it has. - very susceptible to overfitting - if features are highly correlated, use PCA first - balance training data amount for each label - limit maximum depth - keep minimum number of samples for a split from being too low - prune after training - Ensemble version: random forest - bootstrap -- sample (with replacement) for each member (individual decision trees). - majority vote (or average) members' outputs. - Variation: extremely random forests - Not only training data is randomly sampled, > instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the **best of these randomly-generated thresholds** is picked as the splitting rule. ([source](http://scikit-learn.org/stable/modules/ensemble.html)) - d-tree can tell repetitive features and mark them "less important". - summary - decision trees - Fast to train - easily interpretable - random forests - highly accurate - does not require hyperparameter search - both - scale **invariant** -- these models are non-parametric! - handles both numerical and categorical data ## @03/28/2018 - regression, boosting and SVMs ### Regression - **Parametric or not?** Linear regression models, and therefore logisitic regression too, are examples of parametric models. - **A Word On Regularization**: No matter L2 regularization ("ridge"), L1 regression ("lasso") or PCR, they all decides for each feature how much it should be suppressed before fed into a linear regression model. ####The Linear Regresion Family - Linear Regression - minimizes root mean squared error (RMSE) - in practice, the `sqrt()` is omitted - minimizes "MSE". - brightside: it has a closed-form solution, but there's potential problem: - space complexity: $X^T\cdot X$ can be big -- $n^2$. May not fit in memory. - time complexity: $n\cdot d^2$ (d is the # of features) -- can be big. - If d>n, then the X-inversion step will hit a singular point error and fail. - Scale invariant -- needs no normalization - **ridge regression** := Linear regression + $\lambda w^T\cdot w$ - Minimizes MSE + L2 penalty - a.k.a. "ridge", "L2 regularization", etc. - idea: shrinks all the weights a little -- just shrink, not making any feature's weight go to zero. - NOT LONGER scale invariant - **Elastic net** := linear regression + $\lambda_1 w^T\cdot w$ + $\lambda_2$ L1 penalty - L1 penalty is also called the "lasso". - Idea: L1 can drive least-important features' weights to zero. - now you have two hyperparameters for regulation: $\lambda_1$ and $\lambda_2$ #### Principal Component Regression 1. Do PCA on X. 2. Project X onto the PCXA loadings: $T=X\cdot W$ (nxk = nxp * pxk) 3. Use T as training data instead of X. Usually linear regression too. 4. To predict using this PCR model, project test examples to W, then feed into the "linear regression" sub-model. - Often used in computational linguistics. - Different from the ones introduced above: PCR can be called a "semi-supervised learning algorithm" - PCA part is unsupervised - actual regression part is supervised. - just like PCA -- not scale-invariant ####Logistic Regression := linear regression + filtering using logistic function + binarizing results to 0&1 - for binary labels (thus for classfication, rather than prediction) - Since the output will be binarized to 0&1, the value before this binarization is considered to be the probability -- probability estimate: $\sigma(w^T\cdot X)$ ### Boosting - train a series of dumb classifiers, each one focusing more on the examples mis-classified by the previous one. - is an ensemble method ### SVM - also for classification. - maximizing margin... - "Soft margin": we can trade-off between margin width and violations of the border. - Usually uses "hinge loss" -- don't care about correctly-classified examples. - kernels - feature engineering (not covered) - linear - polynomial (order can be configured) - radial basis (a.k.a. gaussian) - when use a kernel, not scale-invariant. ## @04/02/2018: Tuning and evaluating classifiers - Complexity - we want to find the optimal complexity that balances the training error and test error -- avoid underfitting and overfitting. - visualize with learning curves. - k-fold cross validation - Holdout sets: part of training data held out and used like a test set. - some curves: - validation curve: x-axis is the value of hyper-parameter; y-axis is the score. ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_validation_curve_0011.png) - take the peak. - learning curve: x-axis is the percentage. ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_0021.png) - Validation/test score should be also high. Don't let it drop -- it would be overfitting. - ROC curves: - ![](http://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_0021.png) - confusion matrix and all of those performace metrics. (probably omittable) - accuracy is often the wrong measure. - some are: error, precision, recall, F1 score. ## @4/4/2018 - Artificial Neural Networks - gradient descent - analytical or numerical derivative - variations in terms of training data granuity: - **(vanilla) gradient descent** requires all training data to be loaded to compute the gradient. - **stochastic gradient descent** calculates the gradient on a per-example basis, allowing for online-learning. - **Mini-batch**: updates the model every $k$ observations -- a hybrid approach to the two above. - Can get caught in **local minima** -- alternative, **simulated annealing**, uses randomness. - according to a "cooling schedule", is initially more likely to randomly jump. - still no gurantee, but lready less susceptable to local minima. - logistic regression is actually a basic "artificial neuron" - activation function - uses sigmoid function as "model" -- in this sense, it's similar to logisitic regression. - Alternative to sigmoid function if you want a **hard classifier**: heaviside function. - A neuron can take multiple inputs -- just like thsoe features inputed in logisitic regression. - with an extra "always-on" (i.e. always emitting $1$) input, called "bias". Its effect to the neuron's decision is controlled by its associated weight instead of its value itself. - regularization is NOT applied on it. - $\eta$ is the learning rate. - A perceptron is a single layer of many such neurons. - consider this as multiple regressions running at once (each neuron being one). - still learns a linear boundary -- gurantees conergence if only the training data is linearly seperable. - to deal with non-linearly-separable data, - SVM uses kernels - neural nets use extra layers. See below. - (doesn't mean that technically we cannot use kernels with perceptrons) - deeper networks are just models of multiple layers as such -- feed-forward networks - can deal with non-linearly-separable data. - **structure:** Input layer -> hidden layer(s) -> output layer - each layer can have multiple neurons. - activation function: usually ReLU ("rectified linear") here. ### @4/9/2018 - Convolutional Neural Networks Great reading: [source](http://cs231n.github.io/convolutional-networks/). ![](http://cs231n.github.io/assets/nn1/neural_net2.jpeg) - for images - ==CNN uses local receptive field== - types of layers - **Convolution** connects the *receptive field* to a neuron in the next layer - often with overlap (“strides”). "Stripe" is the steplength. - Will hit border of image -- needs zero-padding (how much? also a hyperparam) - kernels in CNN are called "filters". - > The spatial extent of this connectivity is a hyperparameter called the *receptive field* of the neuron (equivalently this is the filter size). - **Pooling** does aggregation (often: max) - pooling ("downsampling"): most often max-pooling, etc. - **Detection** uses sigmoid or RLU - usually uses ReLU as activation function -- Cheaper to calc deriv - ==back propagation==. - ==ways of regularization==: - L2 - Max norm (L∞) - Early stopping - ==Dropout==: randomly dropping out connections between layers (?). - techniques about **Gradient descent** - Gradient descent - - stochastic - gradient clipping - ==Minibatch== - Momentum - Learning rate adaptation - learning rate adaption: - Adagrad: make the learning rate depend previous changes in *each* weight - **Semi-supervised learning** #@04/11/2018 - Time Series - What makes a TS different from say a regular regression problem? 1. It is **time dependent**. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case. 2. Along with an increasing or decreasing trend, most TS have some form of **seasonality trends**, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over time, you will invariably find higher sales in winter seasons. - some techniques - use a moving average to smooth out short-term fluctuations and highlight longer-term trends or cycles. - use `log()` if data appears to be exponential -- this can give something showing more linearity -- easier to deal with. - Time series are **stationary** if they do not have trend or seasonal effects. - Summary statistics (such as the mean or variance) calculated over stationary time series are consistent over time. - Stationarity is an assumption underlying many statistical procedures used in time series analyses, so non-stationary data is often transformed to become stationary. **How to make a time series stationary**: - **Estimate and eliminate trend** - - **Transformation** – e.g. take log, which penalizes higher values more than smaller ones. - **Aggregation** – take average for a time period like monthly/weekly averages - **Smoothing** – take rolling (moving) averages. - - “weighted moving average” gives more recent values a higher weight than older values - In an “**exponentially weighted moving average**”, weights are assigned to all the previous values with a decay factor. - No data is left beind -- all taken in to consideration. - **Polynomial fitting** – fit a regression model - **Remove trend and seasonality** - - **Differencing** – taking the difference with a particular time lag - **Decomposition** – modeling both trend and seasonality and removing them from the model. - Converting a time series from one frequency to another - Downsampling – higher to lower frequency - Upsampling – lower to higher - **Statistical test: Augmented Dickey-Fuller (ADF)** - Tests the **null hypothesis** that the time-series is non-stationary. - The more negative the Test Statistic is, the stronger the rejection of the hypothesis. - how to make predictions with time series data - Models - **Auto-Regressive model AR(p)** - x(t) = c0 + ct-1 x(t-1) + ct-2 x(t-2) +….+ ct-p x(t-p) + É√t - **Moving Average model (MA(q))** - x(t) = c0 + c t-1 e(t-1) + c t-2 e(t-2) +….+ c t-q e(t-q) + É√t - e(t) = É√t = error in prediction at time t - **Auto-Regressive Integrated Moving Averages (ARIMA)** forecasting - linear equation based three parameters, (p, d, q) - **p auto-regressive (AR) terms** (a.k.a. "lags of dependent variable"). For instance if p is 5, the predictors for x(t) will be x(t-1)….x(t-5). - **q moving average terms (MA)** (a.k.a. "lagged forecast errors in prediction equation"). For instance if q is 5, the predictors for x(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average at ith time and actual value. - **d non-seasonal differences** - - **How to determine p and q?** - Plot autocorrelation functions and partial autocorrelation functions and see when they cross an upper confidence interval. - For this data, p=q=2. - technique: **Differencing** - - y(t) = x(t) – x(t-1), Then fit the model on y(t) - Makes the process more stationary # @04/16/2018 - TensorFlow - TensorFlow is based on a **Computation Graph** executed in parallel. - All data are represented as tensors -- analogy to columns. - Two sets of APIs: - one resembles sklearn - to initialize a classifier, you have to specify feature columns -- different from sklearn. - ​ - one lower level - TensorFlow column datatypes (`tf.contrib.layers.[...]`) - for real-valued features: `real_valued_column` - for categorical features: - If you know all possible values: use `sparse_column_with_keys`. - If you cannot iterate over all categories (or simply want to allocate ordinal values to categorical values on the fly): use `[...]_hash_bucket`. - choose bucket amount wisely -- balance between hashing collisions and memory consumption - use numerical feature as categorical: `bucketized_column` - For feature combinations: `CrossedColumns` ## Distributed TensorFlow - Master nodes know the whole Computation Graph. - Worker nodes know only the operation it's assigned with. - worker 0 may be the "parameter server" and hold mutable data (weights, biases, etc.) - worker 1 may hold the training data and compute some operations. - Data can be stored on a Spark cluster (i.e. in HDFS) and **streamed** to the TensorFlow cluster. Yahoo did this: `TensorFlowOnSpark`. ## Recurrent Neural Networks (RNN) - Handles Time Series - two approaches to time series - use a standard neural net or CNN - such as a nonlinear AR(k) model - use a RNN - generalize HMMs or Linear Dynamical Systems - only choice if inputs are of varying lengths - See **standard hidden markov model**: each "stage" takes the previous result as part of input. - variations - Long Short Term Memory (LSTM) - gated RNNs - stacked RNNs - can be used... - ... to predict the next observation given past observations (like an ARIMA model) - ... to map one sequence to another sequence ("encoder-decoder" structure) - translates sentences from one language to another - chatbot - auto-caption image # @04/18/2018 - Online Learning - Online learning methods - Least mean squares (LMS) - Online regression -- L2 error - Perceptron - Idea - if we get it right: no change; - if we got it wrong: $\vec w_{i+1} = \vec w_i + \vec y_i \vec x_i$. - This makes $w$ look more like $\vec x_i$, thus the hyperplane defined by $\vec w$ is more orthogonal to this example of $\vec{x}_i$. - In practice, we use **averaged perceptrons** -- a cheap approximation to **voted perceptons**. - Return as its final model the average of all intermediate models - sounds like a ensemble learning, but actually it just keeps one single, averaged model at any time. - Better than voted perceptons: run-time nearly as fast as single percepton. - can use **kernels**. - variation: Online SVM -- Hinge loss - Online K-means - different from Neural nets: we look at each example and throw it away. ## Analyzing Real-Time Data Streams - **Data Streams** - may not be peroidic - update most often have a delay from actual event - timestamp reported may... - not be precise (time sync server problem, etc.) - has timezone problem (sometimes needed, sometimes to remove) - consider operations on data streams to be "continuous queries". - outputs can also be considered data streams - types - cumulative operations -- performed on all data streamed - **Rolling or windowed** operations (e.g. a "last-two-minute window".) - tumbling windows: no overlap - sliding windows: move a fixed length every step (like a queue) - sliding + partitions windows: shards data according to key to two sliding windows - architechture - lambda architechture - Apache Spark Streaming - ... # @04/23/2018 - Stream Processing Systems (a wrap-up of streaming processing) - Apache Spark - component: - data streams exposed as ever-changing **dataframes** - We define **windows** over streams - "joins/merges" make sense on windows, not much on streams themselves. - Based on “micro batching” – periodic invocations of Spark Engine on batches of tuples - To process: - `StreamingContext` gets started - `awaitTermination()` is run - Can process whatever is in the DataStream as a DataFrame - Can run `countByWindow()` etc. to get time-based or tuple-based windows - Apache Storm (and Heron): distributed streams among distributed modules - components - **spout**: interfaces with the world, **produce** streams - emits lists of tuples - **bolt**: receive streams, optionally producing streams + read/update states. - structure: spouts and bolts are usually pipelined. They can also be **stacked** (for distributed computing). - promises robust execution (even when compared to Spark) - can be used to mimic MapReduce structure, - To Storm, `streamparse` is what `pyspark` is to Spark. # Visualization - Get a **holistic** sense of **data** as we load + analyze it - histograms, scatter plots, correlations, time series - Understand our **algorithms’ performance** - learning curve, validation curve, ROC graph - Present information as part of a **report** or “**dashboard**” - figures illustrating performance (in the economic sense, etc.) - Potential Kinds of Plots - Exploratory graphics – for the data scientist. Want to create rapidly, iterate, - Communication graphics – for you to communicate your findings, Again, iteration is important - “Grammar of Graphics” - Basis of R’s ggplot2 - Python port: `ggplot` - Divides plots into: - layers - data - aesthetic mappings - geometric objects - **statistical transformations** - position adjustments - scales - coordinate system - facets (groups) - more advanced visualization in python - `seaborn`: Builds upon `matplotlib` with a focus on statistical plots - `lightning` + `d3.js`: “Dashboards” - requires server -- Jupyter is also a web server # @04/25/2018 - Data Science Ethnics - Why do people do the right thing? - Morality (Ethnics) - Enthical principles - **Autonomy**: The right to control your data, possibly via *surrogates* - **Informed Consent**: You should explicitly approve use of your data based on understanding - required in human-subjects research - must understand what is being done - must voluntarily consent to the experiment - must have the right to withdraw consent at any time - but no one requires them in “ordinary conduct of business” - consequently we are constantly subject to tests such as A/B tests. - **Beneficence**: People using your data should do it for your benefit - or at least **Non-maleficence**: Do no harm - **Differential privacy** aims to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. - we want results that are ... - reproducible - fair