Google bigquery the definitive guide: data warehousing, analytics, and machine learning at scale
Publication details: Mumbai O'reilly/Shroff Publishers &Distributors Pvt Ltd 2019Description: xvii, 475 PaperISBN:- 978-93-5213-921-7
- 006.3/Lak/Tig
Item type | Current library | Call number | Status | Date due | Barcode | Item holds | |
---|---|---|---|---|---|---|---|
Book | Main Library | 006.3/Lak/Tig/36989 (Browse shelf(Opens below)) | Available | 11136989 |
Table of Contents
Preface
Who Is This Book For?
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. What Is Google BigQuery?
Data Processing Architectures
Relational Database Management System
MapReduce Framework
BigQuery: A Serverless, Distributed SQL Engine
Working with BigQuery
Deriving Insights Across Datasets
ETL, EL, and ELT
Powerful Analytics
Simplicity of Management
How BigQuery Came About
What Makes BigQuery Possible?
Separation of Compute and Storage
Storage and Networking Infrastructure
Managed Storage
Integration with Google Cloud Platform
Security and Compliance
Summary
2. Query Essentials
Simple Queries
Retrieving Rows by Using SELECT
Aliasing Column Names with AS
Filtering with WHERE
SELECT *, EXCEPT, REPLACE
Subqueries with WITH
Sorting with ORDER BY
Aggregates
Computing Aggregates by Using GROUP BY
Counting Records by Using COUNT
Filtering Grouped Items by Using HAVING
Finding Unique Values by Using DISTINCT
A Brief Primer on Arrays and Structs
Creating Arrays by Using ARRAY_AGG
Array of STRUCT
TUPLE
Working with Arrays
UNNEST an Array
Joining Tables
The JOIN Explained
INNER JOIN
CROSS JOIN
OUTER JOIN
Saving and Sharing
Query History and Caching
Saved Queries
Views Versus Shared Queries
Summary
3. Data Types, Functions, and Operators
Numeric Types and Functions
Mathematical Functions
Standard-Compliant Floating-Point Division
SAFE Functions
Comparisons
Precise Decimal Calculations with NUMERIC
Working with BOOL
Logical Operations
Conditional Expressions
Cleaner NULL-Handling with COALESCE
Casting and Coercion
Using COUNTIF to Avoid Casting Booleans
String Functions
Internationalization
Printing and Parsing
String Manipulation Functions
Transformation Functions
Regular Expressions
Summary of String Functions
Working with TIMESTAMP
Parsing and Formatting Timestamps
Extracting Calendar Parts
Arithmetic with Timestamps
Date, Time, and DateTime
Working with GIS Functions
Summary
4. Loading Data into BigQuery
The Basics
Loading from a Local Source
Specifying a Schema
Copying into a New Table
Data Management (DDL and DML)
Loading Data Efficiently
Federated Queries and External Data Sources
How to Use Federated Queries
When to Use Federated Queries and External Data Sources
Interactive Exploration and Querying of Data in Google Sheets
SQL Queries on Data in Cloud Bigtable
Transfers and Exports
Data Transfer Service
Exporting Stackdriver Logs
Using Cloud Dataflow to Read/Write from BigQuery
Moving On-Premises Data
Data Migration Methods
Summary
5. Developing with BigQuery
Developing Programmatically
Accessing BigQuery via the REST API
Google Cloud Client Library
Accessing BigQuery from Data Science Tools
Notebooks on Google Cloud Platform
Working with BigQuery, pandas, and Jupyter
Working with BigQuery from R
Cloud Dataflow
JDBC/ODBC drivers
Incorporating BigQuery Data into Google Slides (in G Suite)
Bash Scripting with BigQuery
Creating Datasets and Tables
Executing Queries
BigQuery Objects
Summary
6. Architecture of BigQuery
High-Level Architecture
Life of a Query Request
BigQuery Upgrades
Query Engine (Dremel)
Dremel Architecture
Query Execution
Storage
Storage Data
Metadata
Summary
7. Optimizing Performance and Cost
Principles of Performance
Key Drivers of Performance
Controlling Cost
Measuring and Troubleshooting
Measuring Query Speed Using REST API
Measuring Query Speed Using BigQuery Workload Tester
Troubleshooting Workloads Using Stackdriver
Reading Query Plan Information
Increasing Query Speed
Minimizing I/O
Caching the Results of Previous Queries
Performing Efficient Joins
Avoiding Overwhelming a Worker
Using Approximate Aggregation Functions
Optimizing How Data Is Stored and Accessed
Minimizing Network Overhead
Choosing an Efficient Storage Format
Partitioning Tables to Reduce Scan Size
Clustering Tables Based on High-Cardinality Keys
Time-Insensitive Use Cases
Batch Queries
File Loads
Summary
Checklist
8. Advanced Queries
Reusable Queries
Parameterized Queries
SQL User-Defined Functions
Reusing Parts of Queries
Advanced SQL
Working with Arrays
Window Functions
Table Metadata
Data Definition Language and Data Manipulation Language
Beyond SQL
JavaScript UDFs
Scripting
Advanced Functions
BigQuery Geographic Information Systems
Useful Statistical Functions
Hash Algorithms
Summary
9. Machine Learning in BigQuery
What Is Machine Learning?
Formulating a Machine Learning Problem
Types of Machine Learning Problems
Building a Regression Model
Choose the Label
Exploring the Dataset to Find Features
Creating a Training Dataset
Training and Evaluating the Model
Predicting with the Model
Examining Model Weights
More-Complex Regression Models
Building a Classification Model
Training
Evaluation
Prediction
Choosing the Threshold
Customizing BigQuery ML
Controlling Data Split
Balancing Classes
Regularization
k-Means Clustering
What’s Being Clustered?
Clustering Bicycle Stations
Carrying Out Clustering
Understanding the Clusters
Data-Driven Decisions
Recommender Systems
The MovieLens Dataset
Matrix Factorization
Making Recommendations
Incorporating User and Movie Information
Custom Machine Learning Models on GCP
Hyperparameter Tuning
AutoML
Support for TensorFlow
Summary
10. Administering and Securing BigQuery
Infrastructure Security
Identity and Access Management
Identity
Role
Resource
Administering BigQuery
Job Management
Authorizing Users
Restoring Deleted Records and Tables
Continuous Integration/Continuous Deployment
Cost/Billing Exports
Dashboards, Monitoring, and Audit Logging
Availability, Disaster Recovery, and Encryption
Zones, Regions, and Multiregions
BigQuery and Failure Handling
Durability, Backups, and Disaster Recovery
Privacy and Encryption
Regulatory Compliance
Data Locality
Restricting Access to Subsets of Data
Removing All Transactions Related to a Single Individual
Data Loss Prevention
CMEK
Data Exfiltration Protection
Summary
Index
Book Description
Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently.
Valliappa Lakshmanan, tech lead for Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices for modern data warehousing within an autoscaled, serverless public cloud. Whether you want to explore parts of BigQuery you’re not familiar with or prefer to focus on specific tasks, this reference is indispensable.
There are no comments on this title.