Load BigQuery Data in Google Colaboratory

If you are familiar with BigQuery you already know that it is a really powerful tool for data analysis. However, you might find yourself writing multiple versions of the same queries and having to save them in BigQuery. With Google Colaboratory you can still use your familiar queries written in SQL, while adding a more organized structure for your analysis. I will explain what Google Colabs is and how you can leverage it to load BigQuery data.

Load BigQuery Data in Google Colaboratory header

Google Colaboratory

In my opinion, the most annoying part of coding is the setup of your “environment”. The very first time you start coding, you need to install the latest version of the coding language. Though even when you have written many scripts, you still have to make sure you install all the right modules. It is best practice to set up an “environment” where you only install the required packages/modules. With Google Colabs you can run Python code on Google their servers. In technical terms, you are running your script on a virtual machine. In other words, you are running your script on another computer, which is owned by Google and is what we call “in the cloud”.

The best part is, that you do not need to download anything on your own computer, and can use this virtual machine instantly. You can easily install any module on that machine, without having to worry about cluttering your own environment. The scripts are run in a so-called Jupyter Notebook. These Jupyter Notebook files are saved within your drive and you can easily share them with anyone. As the name suggests you can consider this a notebook where you can add different sections of code, thereby making your code well-structured and easy to understand. That is why I really like using Google Colaboratory to load BigQuery data. 

BigQuery

Google offers a way to store large quantities of data in BigQuery, and facilitating the fast retrieval using SQL. You can create your SQL queries using the BigQuery interface, but it has its limitations. For starters, there are no data visualisation options in BigQuery. Moreover, BigQuery does not allow any dynamic variables to be passed and recursive approaches are complicated. Therefore, I like to use Google Colabs as I can use modules like Pandas, Seaborn, and Numpy to perform data analysis. 

How to load BigQuery data in Google Colabs

Both BigQuery and Colabs are owned by Google. Therefore, an easy integration between these two tools exists. You can create a new Colab notebook by going to Google Drive and clicking on a New File. Alternatively, you can open this notebook and copy it to your own drive. The Python code for loading BigQuery starts with authentication and is followed by a so-called “magic” command (identified with the double percentage marks %%). You need to run these two blocks of code in separate cells to prevent any errors.

from google.colab import auth
auth.authenticate_user()

When you execute the auth.authenticate_user() code in your notebook by pressing SHIFT + ENTER, the notebook will open a new tab in your browser and allows you to log in. You are then authorising the notebook to connect to your BigQuery projects associated with your Google account.

%%bigquery --project projectid
SELECT 
  COUNT(*) as total_rows
FROM `dataset.table`

The magics command (%%) lets the notebook knows you are about to load BigQuery data. The query you want to execute is surrounded by the three quotes (“””). Normally, within Python, you use these three quotes for documentation text. However, within this situation, you use it to add your SQL query in Python. As mentioned at the beginning of the article, you are able to use Python variables in your SQL query. You can find an example of how to do this below:

%%bigquery --project yourprojectid
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.%%bigquery --project projectid
SELECT 
  COUNT(*) as total_rows
FROM `{0}.{1}`.format(dataset, table)

You can also save the output to a dataframe instead of displaying it immediately. I prefer this approach, as it allows you to further transform the data or load it into a data visualisation module such as Seaborn. The next lines of code will save the output in the variable called “df”.

%%bigquery --project projectid df
SELECT 
  COUNT(*) as total_rows
FROM `dataset.table`

I hope you enjoyed learning how to load BigQuery data using Google Colaboratory. Do not hesitate to comment below with any questions. If you want to know more about automating with Python, do not forget to check my other Python articles.

Leave a Comment

Your email address will not be published. Required fields are marked *