For a better reading experience, the following is the reference to different objects on Acho:
πŸ”€ : Variable | πŸ”˜ : Button | πŸ“‹ : Table

How to combine dozens of big datasets together in a minute?

When doing data analysis, rarely do you rely on one dataset. You may have multiple datasets from many third-party apps such as CRM and payment processors. To see how CRM influences your revenue, you will need to combine its data with a payment processor’s data. This way, you can receive a table containing for example a list of vendor invoices. Merging these datasets together can help you analyze how much you will have to pay in the next couple of weeks. Β 

If you only have two tables, you can just paste a table to another one or apply the β€œvlookup” function to combine tables in Excel. However, if you have 10 datasets, it may take you a lot of time, not to mention that your data may have hundreds of thousands of rows.

Generally, if your datasets are stored in a relational database, writing SQL queries to combine tables is more efficient. First, SQL supports various methods to merge tables. Second, you can directly combine tables when extracting data from the database. If you don't know how to write SQL queries, or your data is stored in flat files (such as .csv or .xlsx), another option is to use a cloud data warehouse platform, such as Acho Studio, to combine the dataset.

In this article, I will show you how to combine multiple datasets on Acho Studio.

Method 1: πŸ”˜ Union

πŸ”˜ Union is to append a table below another table, so it adds more rows to the original table. As the following graph, πŸ”˜ Union is to put πŸ“‹ Table B below πŸ“‹ Table A.

the_process_of_union
The process of Union

Notice that πŸ”˜ Union will never change the schema of the original table. To apply πŸ”˜ Union, πŸ“‹ Table B should have the same number of columns as πŸ“‹ Table A. Also, each column in πŸ“‹ Table B should have the same data type with the corresponding column in πŸ“‹ Table A. For example, the first column in πŸ“‹ Table A is string and the corresponding column in πŸ“‹ Table B should be string as well.

How to union tables?

  1. Choose a table that you want to union
  2. Specify columns that match the original table. Be careful that the columns that you selected should have the same data type with the columns in the original table.

Suppose you have 5 tables. Each table records customer orders from May 8 to May 12 separately.

demo_datasets_customer_orders
Demo datasets - customer orders

To combine these tables together, you can click the πŸ”˜ SQL Editor action, and paste the following code.

SELECT order_id, customer_id,postal_code, sales FROM orders_0508
UNION ALL
SELECT order_id, customer_id,postal_code, sales FROM orders_0509
UNION ALL
SELECT order_id, customer_id,postal_code, sales FROM orders_0510

UNION ALL

SELECT order_id, customer_id,postal_code, sales FROM orders_0511

UNION ALL

SELECT order_id, customer_id,postal_code, sales FROM orders_0512;

If you don't know how to write SQL queries, you can just go on the πŸ”˜ Union action. Then, choose the table that you want to append and specify column names that match columns in the first table. If the column names are the same, the system can fill it automatically.

acho_studio_union
Acho Studio - Union

The πŸ”˜ Union action will append orders_0509, orders_0510, orders_0511, and orders_0512 to the Β orders_0508 table.

* Notice that the πŸ”˜ Union action will remove duplicate rows automatically. That is, if two rows have exactly the same values in each column, the program will only keep one of them.

How to identify which table from which table after applying Union?

In our case, it is hard to distinguish which row from which date Β since these five tables have exactly the same schema. The only difference is the table name. To differentiate these rows, you can use the πŸ”˜ Formula action to create a column called date for each table, and then apply πŸ”˜ Union. Thus, you can know the date of each order.

acho_studio_formula
Acho Studio - Formula

Moreover, if you don't want to remove duplicate rows automatically after applying Union, you can use this method to keep all repeated rows.

When to use πŸ”˜ Union?

  1. Datasets must include the same columns with the same data type.
  2. You don't want to change the schema of the original table.
  3. You may want to combine 2 or more tables.

Method 2: πŸ”˜ Join

πŸ”˜ Join is to merge πŸ“‹ tables side by side based on the πŸ”€ key columns. This method compares values of the key columns in two tables. Rows with the same value in the πŸ”€ key columns will be concatenated together. Basically, πŸ”˜ Join add more columns to the original table

the_process_of_join
The process of Join

How to join πŸ“‹ tables?

  1. Choose a table that you want to combine.
  2. Select one or more πŸ”€ key columns. Notice that the πŸ”€ key columns in two tables should have the same data type.
  3. Select the type of πŸ”˜ join: inner join, left join or right join. To see the difference between three joins, please visit here. Β 
types_of_join
Types of Join

Here, we will use COVID-19 datasets to demonstrate how to combine datasets through Join.

Suppose you want to research how social-distancing and mask-wearing influence the infection rate of COVID-19. You do some research and find some datasets containing information that you need. Here are the tables:

πŸ“‹ Table A: us_counties (tracks the number of COVID-19 cases and deaths by county and by date)

table_a_us_counties
Table A - us_counties

πŸ“‹ Table B: mask_use_by_county (records the frequency of mask-wearing by county)

table_b_mask_use_by_county
Table B - mask_use_by_county

πŸ“‹ Table C: google_community (This dataset is from Google community mobility report and it tracks how people change their visits to places like grocery stores and parks during COVID-19)

table_c_google_mobility
Table C - google_mobility

In this case, we want to combine these three tables. Here is our process.

the_process_to_join_covid19_tables
The process to join COVID-19 tables

First is to define the relationship between the πŸ“‹ us_counties table and the πŸ“‹ mask_use_by_county table. Since two tables have a common column called πŸ”€ county_fips_code, and thus you can combine data based on this column. Moreover, to keep all rows in the πŸ“‹ us_counties table, you can choose the left outer method to πŸ”˜ join. Β Next, πŸ“‹ combine us_counties and πŸ“‹ google_mobility based on πŸ”€ county_fips_code and πŸ”€ date .

In the πŸ”˜ Β Join action, you can implement the above strategy like this.

acho_studio_join
Acho Studio - Join

Or, You can also paste the following code in the πŸ”˜ SQL editor:

SELECT *
FROM us_counties AS A
LEFT JOIN mask_use_by_county AS B
  ON A.county_fips_code = B.county_fips_code
LEFT JOIN google_mobility AS C
      ON A.county_fips_code = C.county_fips_code
      AND A.date = C.date

The Join action is to append all columns from a table to another table. If you don't want some of the columns, you can just go on the Hide Columns action to hide them.

When to use πŸ”˜ Join?

  1. Two datasets only have one or two same columns that can be used as β€œπŸ”€ key columns” that connect πŸ“‹ tables.
  2. You don't want to lose any columns when combining datasets.

Set up Data Sync

After combining datasets, you can set up Data Sync on Acho to update data automatically. When your database or data source has new data, you don't need to combine these datasets every time. Every time you log into Acho, you can acquire a dataset that you want. Β Data Sync can help you save a lot of time on repeating works and improve efficiency of the analysis process.

Hope this article helps. Contact us in the chat box if you have any questions.