What is Data Loading and loading data into Google BigQuery ?
Data loading is the process of copying and loading data sets or data from a source file or an application to a cloud data warehouse. You must first load your data in to any cloud data warehouse so that data teams can analyse data and provide insights.
Google Cloud BigQuery
In this tutorial you would learn how to load data in to Google cloud data warehouse like BigQuery.
BigQuery is a serverless, highly-scalable and cost-effective data warehouse with built-in business intelligence engine and artificial intelligence platform.
Google BigQuery say’s that you just load your data into BigQuery and let google cloud platform handle all the hardwork. BigQuery an be accessed either by using Console or WebUI or command-line tool. Let us see an example how to load/move data in to BigQuery.
Data loading into BigQuery
You can use bq, the python based command line tool for BigQuery to load sample data into BigQuery and perform some queries on the data sets/tables.
Pre-requisites – Ensure you have your own GCP account or project for this exercise.
BigQuery offers a number of sample tables that you can run queries against or you can create/upload your own dataset. The below example shows you step by step process to load data in to BigQuery.
Step 1: Create new dataset
Create a new dataset named babynames in your GCP project
sneppets@cloudshell:~ (sneppets-gcp)$ bq mk babynames Dataset 'sneppets-gcp:babynames' successfully created. sneppets@cloudshell:~ (sneppets-gcp)$ bq ls datasetId ----------- babynames
Step 2: Download the data source
In BigQuery before you load the data into table, first you need to add dataset to your project. We are going to use this babynames zip file as data source which contains approximately 7 MB of data about popular baby names, provided by the USSSA.
First download the source file in local directory using wget as shown below.
sneppets@cloudshell:~ (sneppets-gcp)$ wget http://www.ssa.gov/OACT/babynames/names.zip URL transformed to HTTPS due to an HSTS policy --2019-11-17 19:04:22-- https://www.ssa.gov/OACT/babynames/names.zip Resolving www.ssa.gov (www.ssa.gov)... 137.200.4.16, 2001:1930:e03::aaaa Connecting to www.ssa.gov (www.ssa.gov)|137.200.4.16|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7200451 (6.9M) [application/zip] Saving to: ‘names.zip’ names.zip 100%[========================================================================>] 6.87M 282KB/s in 27s 2019-11-17 19:04:50 (263 KB/s) - ‘names.zip’ saved [7200451/7200451] sneppets@cloudshell:~ (sneppets-gcp)$ ls names.zip
Now unzip the file. You could see big list of files! List them and check once using ls command
sneppets@cloudshell:~ (sneppets-gcp)$ unzip names.zip sneppets@cloudshell:~ (sneppets-gcp)$ ls
Step 3: Upload/Load the dataset
This is the most important step, you have to use bq load command which is used to create or update table and load all the data in single step.
The bq load arguments you will be running are:
datasetID: babynames tableID: names2010 source: yob2010.txt schema: name:string,gender:string,count:integer
Now create your table using the following command
sneppets@cloudshell:~ (sneppets-gcp)$ bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer Upload complete. Waiting on bqjob_r45d9b3ed11bed690_0000016e799a04a1_1 ... (0s) Current status: DONE
You can run the following command to check whether table names2010 appears now in your dataset babynames
sneppets@cloudshell:~ (sneppets-gcp)$ bq ls babynames tableId Type Labels Time Partitioning Clustered Fields ----------- ------- -------- ------------------- ------------------ names2010 TABLE
Step 4: Run Queries
Now you can query the data in a way you want and check the results. For instance, let us run the following command to return top 5 most popular boys names
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count DESC LIMIT 5"
sneppets@cloudshell:~ (sneppets-gcp)$ bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count DESC LIMIT 5" Waiting on bqjob_r3a7870c4e97967c1_0000016e799f72b8_1 ... (0s) Current status: DONE +---------+-------+ | name | count | +---------+-------+ | Jacob | 22127 | | Ethan | 18002 | | Michael | 17350 | | Jayden | 17179 | | William | 17051 | +---------+-------+
Clean up
You can run bq rm -r command to remove babynames dataset and all tables in that dataset from your GCP project.
bq rm -r babynames
Further Learning
- Google Virtual Private Cloud (VPC) Networks Lab Exercise
- Google Compute Engine Introduction and Features
- Deploy an Application to Kubernetes running on Google Cloud Kubernetes Engine (GKE)
- GCP Virtual Private Network (VPN) Tunnels Lab Example