District Analyst

View Original

How to Organize and Wrangle Data

Your data analyst in Washington D.C. often begins each project with organizing data. Organizing your data makes it very easy to gather relevant information from your data. In an organization there is often multiple sources of data that need to be brought together to provide a complete view of your processes.

The process of combining data from multiple sources into a single repository is referred to as data integration. For example an organization that sells products online needs to organize data on sales, store inventory, items returned by customers, orders placed from suppliers, and revenue for each product. Through data integration all of this data is combined and segmented to provide valuable insights. 

Understanding Data Integration

Data integration is one of the most important pillars of data organization. By performing data integration you are able to remove duplicates in your data, correct errors in your data, apply transformations on your data according to business rules, and store your data. When you have very small volumes of data you are often able to manually move them from different sources into a common store. However, large volumes of data manual integration is not viable so you need to rely on a dedicated data integration tool.

Data organization requires a strategy and designed approach that is able to cope with the speed at which data arrives, its various structures, and volume. Data is often categorized as structured, semi-structured, or unstructured. Structured data is neatly organized into rows and columns. Semi-structured data has some data in rows and columns and others that cannot be organized into rows and columns. Unstructured data does not have any notion of rows and columns.

Data volume often ranges from a few kilobytes to terabytes. Data also has the ability arrive at intervals of milliseconds, minutes, hours, days or even weeks. Depending on your data volume, structure, and frequency at which data arrives you need to select an appropriate tool.

Each organization has different data management challenges that cannot be solved with a single tool. Identifying the right tool or tools is a prerequisite for gaining value from your data. In this post we are going to look at three tools that you can use to organize and analyze your data: Excel, databases, and Hadoop. We will discuss situations in which each is appropriate.

Excel

Excel is widely used for data organization and analysis because it is easily available and very user friendly. Excel is an excellent tool for organizing structured data with volumes that are able to fit within your system memory. With Excel you are able to easily perform calculations using formulas built in functions or custom built functions. With PowerQuery you are able to extract  data from different sources into Excel without much coding required. Excel also enables you to  import data from the web from databases such as a SQL Server and Mysql. Importing and integrating data from files such as .csv, .xml, text files, Azure, and Excel data tables and other data sources is also simplified.

Once your data is in Excel you may begin analyzing it with other compatible tools in the Excel software family, known as add-ons. For example, with PowerPivot you are able to analyze your data using pivot tables and pivot charts. With PowerView you are able to visualize your data using charts and maps.

Databases

Excel helps you organize and analyze structured data that fits within your system memory. When your data cannot fit within your system memory or it is not structured databases are the right tools. Databases are able to handle gigabytes or terabytes of data. They are often broadly categorized as relational (SQL) databases and NoSQL databases. Examples of relational databases are Oracle, SQL Server, Mysql, IBM DB2, and PostgreSQL. Examples of NoSQL databases are MongoDB, Cassandra, and HBase.

Relational databases are suitable for organizing large volumes of data that are structured. They have a language referred to as SQL which is used to manipulate the gathered data. With relational databases you are able to import data from other relational databases and business applications like CRM and user friendly files such as .csv, text as well as other mainframe databases and legacy applications.

To integrate data from different sources into your relational database you can use SQL or rely on a dedicated data integration tool. Data integration tools help your data analysts source your data, clean your data, and load your data into your database. Once your data has been cleaned and stored in your relational databases you can use business intelligence applications for data analysis and visualization. Examples of business intelligence applications are IBM Cognos, Tableau, and Qlikview.

Additionally, NoSQL databases are suitable for organizing very large volumes of data that are stored on multiple servers and are semi-structured or unstructured. NoSQL databases are an excellent choice when relational databases cannot handle the frequency, volume, and variety of data that needs to be organized and analyzed.

Hadoop

Hadoop is an ecosystem of data management tools that have been developed by Apache to handle growth in data because existing tools could not handle the volume, variety and velocity of data. Often data of such magnitude is difficult to host on a single server. Therefore Hadoop was developed as a system that is able to process data stored on hundreds or even thousands of servers and it has been a success.

The software is designed to handle very large amounts of data; whether the data is structured or unstructured. With Hadoop you are able to process data whether it continuously streams in or arrives in batches. It has a specialized file system referred to as Hadoop file system (HDFS) for storing data.

Hadoop has the tools to import data into HDFS and export data out of HDFS. The data movement tools enable your data analyst to get data from different sources such as databases, social media, web, and files. Add-ons such as Sqoop and Pig enable you to source data and move it into Hadoop. Sqoop specializes in exporting data from relational databases to Hadoop. Pig enables you to source structured and unstructured data and import it into Hadoop. Once your data is in Hadoop you are able to write MapReduce programs in Java to analyze your data.

As data analysis continues to evolve more and more tools are introduced to support data gathering, structuring, and presenting. For example, due to the complexity of writing MapReduce code various tools have been developed to simplify data analysis in the platform. Hadoop Hive enables you to develop data warehouses and analyze data using a similar language to SQL. Or to discover patterns in your data you can use Spark to develop machine learning algorithms. To manage structured and semi-structured data within Hadoop you use Hbase. These are some of the tools you can use to organize and analyze your data when captured by Hadoop, that simplify the process.

Selecting the right tool to organize and analyze your data is very important in understanding your data. With the right tool you are able to reap the benefits of insights within your data and thus help you reduce waste, improve customer retention, and develop reliable business strategies. Armed with knowledge from your data you are able to make more informed brand decisions.