Using Google Cloud Platform for data analysis
I must admit I have been reluctant to use cloud computing for data analysis and to this date, I would rather use Excel for quick data exploration. There’s something about a pivot table and conditional formatting that unlocks the right side of my brain and gets the analytical thoughts flowing. Sadly, Excel is not built to handle and process big data and after the nth workbook crash, I often find myself turning to cloud computing. And you know what, it’s pretty good.
As a long-time Google Analytics user, I look to Google Cloud Platform for ease of adoption and familiarity with their UI. It helps that GCP has developed several powerful Big Data tools that handle, process and analyse large volumes of data whilst resolving many of the challenges data analysts face at each step of the data analysis process.
Most data analysis projects start with extracting data from one or more data sources, which excel handles pretty well with VLOOKUP’s and INDEX MATCH formula. However, with a limit of just over 1M rows of data, this can restrict your analysis to a sample of the total data and results in very long processing times. Other challenges arise when working with high-velocity data or data that needs several transformations.
BigQuery has proved to be a useful alternative. It’s a fully managed data warehouse which makes it easy to get started, you don't need to deploy any resources, such as disks and virtual machines. Since Google manages the database architecting and maintenance, analysts are free to focus on exploring data and deriving insights and not infrastructure. It also supports data collection from a large number of data sources.
Once the data extraction is completed, the next step to tackle is cleaning and transforming the data into a format ready for analysis. This data preparation step can prove to be a slow process depending on the shape and size of the data and often includes unclear logic which can result in slow exploration.
Cloud Data Prep is a free data transformation tool created by Google Partner Trifacta that is integrated with other GCP tools. It interacts with a sample of your data and provides a top-line description of the data in each column such as the distribution of values. It also has an inbuilt anomaly detection which highlights rows with missing values or when datatype mismatches occur.
Where multiple data sources are used, it tries to detect possible joins and schemas. This no doubt reduces the time it takes to complete this phase. However, with any automated/ intelligent tools, there are several limitations and misinterpretations that can occur, which is why it’s always worth reviewing the data.
Once the processing rules are defined, the tool gives you the option to export the flow or schedule it to be executed via Google Cloud Dataflow. This ability to create an automated ETL pipeline reduces latency to real-time data. Better yet, the transformed dataset can be exported directly into BigQuery tables.
After ETL comes data analysis and, as I mentioned earlier, when working with large volumes of data, querying the data in Excel/on a local drive can be extremely slow. BigQuery would be a good alternative, given it uses highly parallel processing to run queries enabling scalable and fast queries (within seconds). Additionally, the results are cached in a temporary table for 24hours which speeds up data exploration. Because BigQuery uses standard SQL it’s relatively intuitive to use. Be warned though; BigQuery SQL comes with its own set of headaches especially when trying to run complex queries. However, its simplicity and abstraction of infrastructure details make it a good first-step into cloud-based analytics.
One other limitation is SQL isn’t built to run advanced statistical analysis like regressions. To do this you’d have to integrate BigQuery with third-party tools such as Tableau or use its API connection to export the data into R or Python.
If you wanted to present your findings in an automated report or dashboard, BigQuery easily integrates with visualisation tools such as Google Data Studio, Tableau etc.
If you have challenges with complex data analyses, no matter what data tools you have, please get in touch, and we would be happy to discuss with you.
UPDATE: Google recently announced the addition of a new capability in BigQuery that allows users to create, train, evaluate and run machine learning models using SQL-extensions (CREATE, ML.EVALUATE and ML.PREDICT). It currently features two models, linear regression and logistical regressions which could prove useful for analyst who wants to run predictive analytics tasks.