Data Extraction from the Modern Web
Ryan Mitchell
Python#
Web_Scraping#
HTML#
JavaScript#
APIs#
modern_web#
Data#
web_server#
Modern_Web#
If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.
Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.
Table of Contents
Chapter 1. How the Internet Works
Chapter 2. The Legalities and Ethics of Web Scraping
Chapter 3. Applications of Web Scraping
Chapter 4. Writing Your First Web Scraper
Chapter 5. Advanced HTML Parsing
Chapter 6. Writing Web Crawlers
Chapter 7. Web Crawling Models
Chapter 8. Scrapy
Chapter 9. Storing Data
Chapter 10. Reading Documents
Chapter 11. Working with Dirty Data
Chapter 12. Reading and Writing Natural Languages
Chapter 13. Crawling Through Forms and Logins
Chapter 14. Scraping JavaScript
Chapter 15. Crawling Through APIs
Chapter 16. Image Processing and Text Recognition
Chapter 17. Avoiding Scraping Traps
Chapter 18. Testing Your Website with Scrapers
Chapter 19. Web Scraping in Parallel
Chapter 20. Web Scraping Proxies
This book is designed to serve not only as an introduction to web scraping but also as a comprehensive guide to collecting, transforming, and using data from uncooperative sources. Although it uses the Python programming language and covers many Python basics, it should not be used as an introduction to the language.
If you don’t know any Python at all, this book might be a bit of a challenge. Please do not use it as an introductory Python text. With that said, I’ve tried to keep all concepts and code samples at a beginning-to-intermediate Python programming level in order to make the content accessible to a wide range of readers. To this end, there are occasional explanations of more advanced Python programming and general computer science topics where appropriate. If you are a more advanced reader, feel free to skim these parts!
If you’re looking for a more comprehensive Python resource, Introducing Python by Bill Lubanovic (O’Reilly) is a good, if lengthy, guide. For those with shorter attention spans, the video series Introduction to Python by Jessica McKellar (O’Reilly) is an excellent resource. I’ve also enjoyed Think Python by a former professor of mine, Allen Downey (O’Reilly). This last book in particular is ideal for those new to programming, and teaches computer science and software engineering concepts along with the Python language.
Technical books often focus on a single language or technology, but web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of “data gathering.” It should not be used as a complete treatment of any of these subjects, but I believe they are covered in enough detail to get you started writing web scrapers!
Part I covers the subject of web scraping and web crawling in depth, with a strong focus on a small handful of libraries used throughout the book. Part I can easily be used as a comprehensive reference for these libraries and techniques (with certain exceptions, where additional references will be provided). The skills taught in the first part will likely be useful for everyone writing a web scraper, regardless of their particular target or application.
Part II covers additional subjects that the reader might find useful when writing web scrapers, but that might not be useful for all scrapers all the time. These subjects are, unfortunately, too broad to be neatly wrapped up in a single chapter. Because of this, frequent references are made to other resources for additional information.
The structure of this book enables you to easily jump around among chapters to find only the web scraping technique or information that you are looking for. When a concept or piece of code builds on another mentioned in a previous chapter, I explicitly reference the section that it was addressed in.
About the Author
Ryan Mitchell is a senior software engineer at GLG, as well as a speaker and author.
An expert in web scraping, web security, and data science, Ryan has hosted workshops and spoken at many events, including Data Day and DEF CON. She has also taught web programming and data science and consulted on coursework at a variety of institutions. Ryan holds a master's degree in software engineering from Harvard University Extension School and is currently a senior software engineer at GLG where she creates data analysis tools. Ryan is the author of Web Scraping with Python (O'Reilly), as well as Instant Web Scraping with Java (Packt Publishing).