1. Microsoft Excel
Type of tool: Spreadsheet software.
Availability: Commercial.
Mostly used for: Data wrangling and reporting.
Pros: Widely-used, with lots of useful functions and plug-ins.
Cons: Cost, calculation errors, poor at handling big data.
Excel: the world’s best-known spreadsheet software. What’s more, it features calculations and graphing functions that are ideal for data analysis. Whatever your specialism, and no matter what other software you might need, Excel is a staple in the field. Its invaluable built-in features include pivot tables (for sorting or totaling data) and form creation tools. It also has a variety of other functions that streamline data manipulation. For instance, the CONCATENATE function allows you to combine text, numbers, and dates into a single cell. SUMIF lets you create value totals based on variable criteria, and Excel’s search function makes it easy to isolate specific data.
It has limitations though. For instance, it runs very slowly with big datasets and tends to approximate large numbers, leading to inaccuracies. Nevertheless, it’s an important and powerful data analysis tool, and with many plug-ins available, you can easily bypass Excel’s shortcomings. Get started with these ten Excel formulas that all data analysts should know.
2. Python
Type of tool: Programming language.
Availability: Open-source, with thousands of free libraries.
Used for: Everything from data scraping to analysis and reporting.
Pros: Easy to learn, highly versatile, widely-used.
Cons: Memory intensive—doesn’t execute as fast as some other languages.
A programming language with a wide range of uses, Python is a must-have for any data analyst. Unlike more complex languages, it focuses on readability, and its general popularity in the tech field means many programmers are already familiar with it. Python is also extremely versatile; it has a huge range of resource libraries suited to a variety of different data analytics tasks. For example, the NumPy and pandas libraries are great for streamlining highly computational tasks, as well as supporting general data manipulation.
Libraries like Beautiful Soup and Scrapy are used to scrape data from the web, while Matplotlib is excellent for data visualization and reporting. Python’s main drawback is its speed—it is memory intensive and slower than many languages. In general though, if you’re building software from scratch, Python’s benefits far outweigh its drawbacks. You can learn more about Python in this post.
3. R
Type of tool: Programming language.
Availability: Open-source.
Mostly used for: Statistical analysis and data mining.
Pros: Platform independent, highly compatible, lots of packages.
Cons: Slower, less secure, and more complex to learn than Python.
R, like Python, is a popular open-source programming language. It is commonly used to create statistical/data analysis software. R’s syntax is more complex than Python and the learning curve is steeper. However, it was built specifically to deal with heavy statistical computing tasks and is very popular for data visualization. A bit like Python, R also has a network of freely available code, called CRAN (the Comprehensive R Archive Network), which offers 10,000+ packages.
It integrates well with other languages and systems (including big data software) and can call on code from languages like C, C++, and FORTRAN. On the downside, it has poor memory management, and while there is a good community of users to call on for help, R has no dedicated support team. But there is an excellent R-specific integrated development environment (IDE) called RStudio, which is always a bonus!
4. Jupyter Notebook
Type of tool: Interactive authoring software.
Availability: Open-source.
Mostly used for: Sharing code, creating tutorials, presenting work.
Pros: Great for showcasing, language-independent.
Cons: Not self-contained, nor great for collaboration.
Jupyter Notebook is an open-source web application that allows you to create interactive documents. These combine live code, equations, visualizations, and narrative text. Imagine something a bit like a Microsoft word document, only far more interactive, and designed specifically for data analytics! As a data analytics tool, it’s great for showcasing work: Jupyter Notebook runs in the browser and supports over 40 languages, including Python and R. It also integrates with big data analysis tools, like Apache Spark (see below) and offers various outputs from HTML to images, videos, and more.
But as with every tool, it has its limitations. Jupyter Notebook documents have poor version control, and tracking changes is not intuitive. This means it’s not the best place for development and analytics work (you should use a dedicated IDE for these) and it isn’t well suited to collaboration. Since it isn’t self-contained, this also means you have to provide any extra assets (e.g. libraries or runtime systems) to anybody you’re sharing the document with. But for presentation and tutorial purposes, it remains an invaluable data science and data analytics tool.
5. Apache Spark
Type of tool: Data processing framework.
Availability: Open-source.
Mostly used for: Big data processing, machine learning.
Pros: Fast, dynamic, easy to use.
Cons: No file management system, rigid user interface.
Apache Spark is a software framework that allows data analysts and data scientists to quickly process vast data sets. It was first developed in 2012 before being donated to the non-profit Apache Software Foundation. Designed to analyze unstructured big data, Spark distributes computationally heavy analytics tasks across many computers. While other similar frameworks exist (for example, Apache Hadoop) Spark is exceptionally fast. By using RAM rather than local memory, it is around 100x faster than Hadoop. That’s why it’s often used for the development of data-heavy machine learning models.
It even has a library of machine learning algorithms, MLlib, including classification, regression, and clustering algorithms, to name a few. On the downside, consuming so much memory means Spark is computationally expensive. It also lacks a file management system, so it usually needs integration with other software, i.e. Hadoop.
6. SAS
Type of tool: Statistical software suite.
Availability: Commercial.
Mostly used for: Business intelligence, multivariate, and predictive analysis.
Pros: Easily accessible, business-focused, good user support.
Cons: High cost, poor graphical representation.
SAS (which stands for Statistical Analysis System) is a popular commercial suite of business intelligence and data analysis tools. It was developed by the SAS Institute in the 1960s and has evolved ever since. Its main use today is for profiling customers, reporting, data mining, and predictive modeling. Created for an enterprise market, the software is generally more robust, versatile, and easier for large organizations to use. This is because they tend to have varying levels of in-house programming expertise.
But as a commercial product, SAS comes with a hefty price tag. Nevertheless, with cost comes benefits; it regularly has new modules added, based on customer demand. Although it has fewer of these than say, Python libraries, they are highly focused. For instance, it offers modules for specific uses such as anti-money laundering and analytics for the Internet of Things.
Curious about a career in Data Analytics?
Start learning for free!
7. Microsoft Power BI
Type of tool: Business analytics suite.
Availability: Commercial software (with a free version available).
Mostly used for: Everything from data visualization to predictive analytics.
Pros: Great data connectivity, regular updates, good visualizations.
Cons: Clunky user interface, rigid formulas, data limits (in the free version).
At less than a decade old, Power BI is a relative newcomer to the market of data analytics tools. It began life as an Excel plug-in but was redeveloped in the early 2010s as a standalone suite of business data analysis tools. Power BI allows users to create interactive visual reports and dashboards, with a minimal learning curve. Its main selling point is its great data connectivity—it operates seamlessly with Excel (as you’d expect, being a Microsoft product) but also text files, SQL server, and cloud sources, like Google and Facebook analytics.
It also offers strong data visualization but has room for improvement in other areas. For example, it has quite a bulky user interface, rigid formulas, and the proprietary language (Data Analytics Expressions, or ‘DAX’) is not that user-friendly. It does offer several subscriptions though, including a free one. This is great if you want to get to grips with the tool, although the free version does have drawbacks—the main limitation being the low data limit (around 2GB).
8. Tableau
Type of tool: Data visualization tool.
Availability: Commercial.
Mostly used for: Creating data dashboards and worksheets.
Pros: Great visualizations, speed, interactivity, mobile support.
Cons: Poor version control, no data pre-processing.
If you’re looking to create interactive visualizations and dashboards without extensive coding expertise, Tableau is one of the best commercial data analysis tools available. The suite handles large amounts of data better than many other BI tools, and it is very simple to use. It has a visual drag and drop interface (another definite advantage over many other data analysis tools). However, because it has no scripting layer, there’s a limit to what Tableau can do. For instance, it’s not great for pre-processing data or building more complex calculations.
While it does contain functions for manipulating data, these aren’t great. As a rule, you’ll need to carry out scripting functions using Python or R before importing your data into Tableau. But its visualization is pretty top-notch, making it very popular despite its drawbacks. Furthermore, it’s mobile-ready. As a data analyst, mobility might not be your priority, but it’s nice to have if you want to dabble on the move! You can learn more about Tableau in this post.
9. KNIME
Type of tool: Data integration platform.
Availability: Open-source.
Mostly used for: Data mining and machine learning.
Pros: Open-source platform that is great for visually-driven programming.
Cons: Lacks scalability, and technical expertise is needed for some functions.
Last on our list is KNIME (Konstanz Information Miner), an open-source, cloud-based, data integration platform. It was developed in 2004 by software engineers at Konstanz University in Germany. Although first created for the pharmaceutical industry, KNIME’s strength in accruing data from numerous sources into a single system has driven its application in other areas. These include customer analysis, business intelligence, and machine learning.