Data Analysis with Open Source Tools(English, Paperback, Janert Philipp K.)
Quick Overview
Product Price Comparison
Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications. Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve -- rather than rely on tools to think for you. Use graphics to describe data with one, two, or dozens of variables Develop conceptual models using back-of-the-envelope calculations, as well as scaling and probability arguments Mine data with computationally intensive methods such as simulation and clustering Make your conclusions understandable through reports, dashboards, and other metrics programs Understand financial calculations, including the time-value of money Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations Become familiar with different open source programming environments for data analysis About the Author After previous careers in physics and software development, Philipp K. Janert currently provides consulting services for data analysis, algorithm development, and mathematical modeling. He has worked for small start-ups and in large corporate environments, both in the U.S. and overseas. He prefers simple solutions that work to complicated ones that don't, and thinks that purpose is more important than process. Philipp is the author of "Gnuplot in Action - Understanding Data with Graphs" (Manning Publications), and has written for the O'Reilly Network, IBM developerWorks, and IEEE Software. He is named inventor on a handful of patents, and is an occasional contributor to CPAN. He holds a Ph.D. in theoretical physics from the University of Washington. Visit his company website at www.principal-value.com. Table of Contents Chapter 1 Introduction Data Analysis What’s in This Book What’s with the Workshops? What’s with the Math? What You’ll Need What’s Missing Graphics: Looking at Data Chapter 2 A Single Variable: Shape and Distribution Dot and Jitter Plots Histograms and Kernel Density Estimates The Cumulative Distribution Function Rank-Order Plots and Lift Charts Only When Appropriate: Summary Statistics and Box Plots Workshop: NumPy Further Reading Chapter 3 Two Variables: Establishing Relationships Scatter Plots Conquering Noise: Smoothing Logarithmic Plots Banking Linear Regression and All That Showing What’s Important Graphical Analysis and Presentation Graphics Workshop: matplotlib Further Reading Chapter 4 Time As a Variable: Time-Series Analysis Examples The Task Smoothing Don’t Overlook the Obvious! The Correlation Function Optional: Filters and Convolutions Workshop: scipy.signal Further Reading Chapter 5 More Than Two Variables: Graphical Multivariate Analysis False-Color Plots A Lot at a Glance: Multiplots Composition Problems Novel Plot Types Interactive Explorations Workshop: Tools for Multivariate Graphics Further Reading Chapter 6 Intermezzo: A Data Analysis Session A Data Analysis Session Workshop: gnuplot Further Reading Analytics: Modeling Data Chapter 7 Guesstimation and the Back of the Envelope Principles of Guesstimation How Good Are Those Numbers? Optional: A Closer Look at Perturbation Theory and Error Propagation Workshop: The Gnu Scientific Library (GSL) Further Reading Chapter 8 Models from Scaling Arguments Models Arguments from Scale Mean-Field Approximations Common Time-Evolution Scenarios Case Study: How Many Servers Are Best? Why Modeling? Workshop: Sage Further Reading Chapter 9 Arguments from Probability Models The Binomial Distribution and Bernoulli Trials The Gaussian Distribution and the Central Limit Theorem Power-Law Distributions and Non-Normal Statistics Other Distributions Optional: Case Study—Unique Visitors over Time Workshop: Power-Law Distributions Further Reading Chapter 10 What You Really Need to Know About Classical Statistics Genesis Statistics Defined Statistics Explained Controlled Experiments Versus Observational Studies Optional: Bayesian Statistics—The Other Point of View Workshop: R Further Reading Chapter 11 Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That How to Average Averages The Standard Deviation Least Squares Further Reading Computation: Mining Data Chapter 12 Simulations A Warm-Up Question Monte Carlo Simulations Resampling Methods Workshop: Discrete Event Simulations with SimPy Further Reading Chapter 13 Finding Clusters What Constitutes a Cluster? Distance and Similarity Measures Clustering Methods Pre- and Postprocessing Other Thoughts A Special Case: Market Basket Analysis A Word of Warning Workshop: Pycluster and the C Clustering Library Further Reading Chapter 14 Seeing the Forest for the Trees: Finding Important Attributes Principal Component Analysis Visual Techniques Kohonen Maps Workshop: PCA with R Further Reading Chapter 15 Intermezzo: When More Is Different A Horror Story Some Suggestions What About Map/Reduce? Workshop: Generating Permutations Further Reading Applications: Using Data Chapter 16 Reporting, Business Intelligence, and Dashboards Business Intelligence Corporate Metrics and Dashboards Data Quality Issues Workshop: Berkeley DB and SQLite Further Reading Chapter 17 Financial Calculations and Modeling The Time Value of Money Uncertainty in Planning and Opportunity Costs Cost Concepts and Depreciation Should You Care? Is This All That Matters? Workshop: The Newsvendor Problem Further Reading Chapter 18 Predictive Analytics Topics in Predictive Analytics Some Classification Terminology Algorithms for Classification The Process The Secret Sauce The Nature of Statistical Learning Workshop: Two Do-It-Yourself Classifiers Further Reading Chapter 19 Epilogue: Facts Are Not Reality Appendix Programming Environments for Scientific Computation and Data Analysis Software Tools A Catalog of Scientific Software Writing Your Own Further Reading Appendix Results from Calculus Common Functions Calculus Useful Tricks Notation and Basic Math Where to Go from Here Further Reading Appendix Working with Data Sources for Data Cleaning and Conditioning Sampling Data File Formats The Care and Feeding of Your Data Zoo Skills Terminology Further Reading Appendix About the Author Colophon