daily 08/14/2016

SISA allows you to do statistical analysis directly on the Internet.

SISA allows you to do statistical analysis directly on the Internet. Click on one of the procedure names below, fill in the form, click the button, and the analysis will take place on the spot. Study the user friendly guides to statistical procedures to see what procedure is appropriate for your problem.


22 spreadsheet to study distributions

The statistical distribution spreadsheets can only be used if you have Ms Excel installed on your computer. The spreadsheets also seem to work fine in open office. Please “click” the links to the spreadsheets below. If your computer is configured in the right way the spreadsheets will be loaded automatically into excel, otherwise save the spreadsheets and open them as an excel file.


List of probability distributions – Wikipedia, the free encyclopedia

List of probability distributions


What Statistics Topics are Needed for Excelling at Data Science?

Know Thy Distributions.

Pareto

Gaussian

Exponential

von Mises?

The lognormal!

Fitting. Once you’ve got your distributions down, you should know how to fit them to data in slick ways. Start with maximum likelihood and go from there.

Classical hypothesis testing. I think pvalues and frequentist hypothesis testing

Markov chains + bells + whistles.

Basic Bayesian thinking & modeling. Learn to think of everything as a probability distribution instead of just a single value (if appropriate). Be able to assemble the models & compute with them.

Some oldschool stats and probability theory. E.g. “Random variables; transformations, conditional expectation, moment generating functions, convergence, limit theorems, estimation; CramerRao lower bound, maximum likelihood estimation, sufficiency, ancillarity, completeness. RaoBlackwell theorem. Some decision theory.”

Regression! First linear, then nonlinear. (Gasp!)

Machine learning. I know you said “statistics,” but really if you want to be a “data scientist” then machine learning



We have been comfortable with L2 optimization (Euclidian distance metric) for a long time but there is a groundswell of activity in L1 optimization (taxicab distance metric). L1 optimization pushes us out of our comfort zone of meansquared error optimality and associated 2ndorder thinking!

The highlight is the collection of the vast material under 3 topics: Bayes Theorem, Cover Theorem and Neuroscience & ad hoc methods. In ML practice, these ML methods are “wrapped” around by “bootstrap” and “consensus” methods.

Cover Theorem

Estimating the Conditional Expectation,

Perceptrons

Input side: Bootstrap methods
The objective is to maximize Training Set information use. 
Feature subspace.

Output side: Consensus methods
Solve the problem using independent ML methods and combine the results. 
 Combine “weak learners”.
 Random Forest.
 AdaBoost.
 And many more.

These are different learning algorithms
 Combine “weak learners”.


Creating Plots with Python and Plotly  WIRED

 Line 11 – 12: These are two empty lists. In order to make a plot, I need to give plotly a list of values. As I go through each step in the calculation, I will add a value to the list


Test drive — Conda documentation

Conda is both a package manager and an environment manager

But let’s say that you want to use a package that requires a different version of Python than you are currently using

2. Managing environments¶

TIP: Many frequently used options after two dashes () can be abbreviated with just a dash and the first letter. So name and n options are the same and envs and e are the same. See conda help or conda h for a list of abbreviations.

TIP: You can add much more to the conda create command, type conda create help for details.

conda info envs

 Windows: activate snowflakes

Environments are installed by default into the envs directory in your conda directory. You can specify a different path, see conda create help for details.

Which of these environments are you using right now – snowflakes or bunnies? To find out, type the same command:

NOTE: conda also puts an asterisk (*) in front of the active environment in your environment list; see above in “List all environments.”

View a list of packages and versions installed in an environment

TIP: Pip is only a package manager, so it cannot manage environments for you. Pip cannot even update Python, because unlike conda it does not consider Python a package. But it does install some things that conda does not, and vice versa. Both pip and conda are included in Anaconda and Miniconda.


5 EBooks to Read Before Getting into A Data Science or Big Data Career

What Statistics Topics are Needed for Excelling at Data Science?


The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Code can produce rich output such as images, videos, LaTeX, and JavaScript. Interactive widgets can be used to manipulate and visualize data in realtime.

Leverage big data tools, such as Apache Spark, from Python, R and Scala. Explore that same data with pandas, scikitlearn, ggplot2, dplyr, etc.


IPython – Wikipedia, the free encyclopedia

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers introspection, rich media, shell syntax, tab completion, and history.

An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.

IPython Notebook provides a browserbased REPL built upon a number of popular Open Source libraries:

IPython Notebook was added to IPython in the 0.12 release^{[8]} (December 2011). IPython Notebook has been compared to Maple, Mathematica, and Sage.
IPython notebooks frequently draw from SciPy stack^{[9]} libraries like NumPy and SciPy, often installed along with IPython from one of many Scientific Python distributions.^{[9]}

nt.
Project Jupyter[edit]
In 2014, Fernando Pérez announced a spinoff project from IPython called Project Jupyter. IPython will continue to exist as a Python shell and a kernel for Jupyter, while the notebook and other languageagnostic parts of IPython will move under the Jupyter name.^{[11]} Jupyter added support for J


Docker: Data Science Environment with Jupyter

If you want to install your own packages inside the container, you can get into it and run any normal bash shell commands. In order to get into a container, you’ll need to run
docker exec
. Docker exec takes a specific container id, and a command to run. For instance, typingdocker exec it 4greg24134 /bin/bash
will open a shell prompt in the container with id4greg24134
. Theit
flags ensure that we keep an input session open with the container, and can enter commands.

Posted from Diigo. The rest of my favorite links are here.