Scrape, Clean, Explore, and Transform Your Data
Kyran Dale

#Python
#JavaScript
#Data_Visualization
#API
#Scrapy
#NumPy
#Jupyter
#pandas
#Matplotlib
#Seaborn
#RESTful_API
#HTML
#CSS
How do you turn raw, unprocessed, or malformed data into dynamic, interactive web visualizations? In this practical book, author Kyran Dale shows data scientists and analysts--as well as Python and JavaScript developers--how to create the ideal toolchain for the job. By providing engaging examples and stressing hard-earned best practices, this guide teaches you how to leverage the power of best-of-breed Python and JavaScript libraries.
Python provides accessible, powerful, and mature libraries for scraping, cleaning, and processing data. And while JavaScript is the best language when it comes to programming web visualizations, its data processing abilities can't compare with Python's. Together, these two languages are a perfect complement for creating a modern web-visualization toolchain. This book gets you started.
You'll learn how to:
The chief ambition of this book is to describe a data visualization (dataviz) toolchain that, in the era of the internet, is starting to predominate. The guiding principle of this toolchain is that whatever insightful nuggets you have managed to mine from your data deserve a home on the web browser. Being on the web means you can easily choose to distribute your dataviz to a select few (using authentication or restricting to a local network) or the whole world. This is the big idea of the internet and one that dataviz is embracing at a rapid pace. And that means that the future of dataviz involves JavaScript, the only first-class language of the web browser. But JavaScript does not yet have the data-processing stack needed to refine raw data, which means data visualization is inevitably a multilanguage affair. I hope this book provides support for my belief that Python is the natural complementary language to JavaScript’s monopoly of browser visualizations.
Although this book is a big one (that fact is felt most keenly by the author right now), it has had to be very selective, leaving out a lot of really cool Python and JavaScript dataviz tools and focusing on the ones that provide the best building blocks. The number of helpful libraries I couldn’t cover reflects the enormous vitality of the Python and JavaScript data science ecosystems. Even while the book was being written, brilliant new Python and JavaScript libraries were being introduced, and the pace continues.
All data visualization is essentially transformative, and showing the journey from one reflection of a dataset (HTML tables and lists) to a more modern, engaging, interactive, and, fundamentally, browser-based one provides a good way to introduce key data visualization tools in a working context. The challenge is to transform a basic Wikipedia list of Nobel Prize winners into a modern, interactive, browser-based visualization. Thus, the same dataset is presented in a more accessible, engaging form.
The journey from unprocessed data to a fairly rich, user-driven visualization informs the choice of best-of-breed tools. First, we need to get our dataset. Often this is provided by a colleague or client, but to increase the challenge and learn some pretty vital dataviz skills along the way, we learn how to scrape the dataset from the web (Wikipedia’s Nobel Prize pages) using Python’s powerful Scrapy library. This unprocessed dataset then needs to be refined and explored, and there isn’t a much better ecosystem for this than Python’s pandas. Along with Matplotlib in support and driven by a Jupyter notebook, pandas is becoming the gold standard for this kind of forensic data work. With clean data stored (to SQL with SQLAlchemy and SQLLite) and explored, the cherry-picked data stories can be visualized. I cover the use of Matplotlib and Plotly to embed static and dynamic charts from Python to a web page. But for something more ambitious, the supreme dataviz library for the web is the JavaScript-based D3. We cover the essentials of D3 while using them to produce our showpiece Nobel data visualization.
This book is a collection of tools forming a chain, with the creation of the Nobel visualization providing a guiding narrative. You should be able to dip into relevant chapters when and if the need arises; the different parts of the book are self-contained so you can quickly review what you’ve learned when required.
This book is divided into five parts. The first part introduces a basic Python and JavaScript dataviz toolkit, while the next four show how to retrieve raw data, clean it, explore it, and finally transform it into a modern web visualization
Kyran Dale is a jobbing programmer, ex-research scientist, recreational hacker, independent researcher, occasional entrepreneur, cross-country runner and improving jazz pianist. During 15 odd years as a research scientist he hacked a lot of code, learned a lot of libraries and settled on some favorite tools. These days he finds Python, JavaScript, and a little C++ goes a long way to solving most problems out there. He specializes in fast-prototyping and feasibility studies, with an algorithmic bent but is happy to just build cool things.









