MLlib (Spark) is Apache Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. Statsmodels is built on top of the numerical libraries NumPy and SciPy, integrates with Pandas for data handling and uses Patsy for an R-like formula interface.
Statsmodels is part of the scientific Python stack that is oriented towards data analysis, data science and statistics. Statsmodels is built on top of the numerical libraries NumPy and SciPy, integrates with Pandas for data handling and uses Patsy for an R-like formula interface. Graphical functions are based on the Matplotlib library. Statsmodels provides the statistical backend for other Python libraries. Statmodels in free software released under the Modified BSD (3-clause) license.
Linear regression models:
Mixed Linear Model with mixed effects and variance components
GLM: Generalized linear models with support for all of the one-parameter exponential family distributions
Bayesian Mixed GLM for Binomial and Poisson
GEE: Generalized Estimating Equations for one-way clustered or longitudinal data
Nonparametric statistics: Univariate and multivariate kernel density estimators
Datasets: Datasets used for examples and in testing
Statistics: a wide range of statistical tests
Imputation with MICE, regression on order statistic and Gaussian imputation
Tools for reading Stata .dta files, but pandas has a more recent version