For a better reading experience, the following is the reference to different objects on Acho:
🔤 : Variable | 🔘 : Button | 📋 : Table
How to combine dozens of big datasets together in a minute?
When doing data analysis, rarely do you rely on one dataset. You may have multiple datasets from many third-party apps such as CRM and payment processors. To see how CRM influences your revenue, you will need to combine its data with a payment processor’s data. This way, you can receive a table containing for example a list of vendor invoices. Merging these datasets together can help you analyze how much you will have to pay in the next couple of weeks.
If you only have two tables, you can just paste a table to another one or apply the “vlookup” function to combine tables in Excel. However, if you have 10 datasets, it may take you a lot of time, not to mention that your data may have hundreds of thousands of rows.
Generally, if your datasets are stored in a relational database, writing SQL queries to combine tables is more efficient. First, SQL supports various methods to merge tables. Second, you can directly combine tables when extracting data from the database. If you don't know how to write SQL queries, or your data is stored in flat files (such as .csv or .xlsx), another option is to use a cloud data warehouse platform, such as Acho Studio, to combine the dataset.
In this article, I will show you how to combine multiple datasets on Acho Studio.
Method 1: 🔘 Union
🔘 Union is to append a table below another table, so it adds more rows to the original table. As the following graph, 🔘 Union is to put 📋 Table B below 📋 Table A.
Notice that 🔘 Union will never change the schema of the original table. To apply 🔘 Union, 📋 Table B should have the same number of columns as 📋 Table A. Also, each column in 📋 Table B should have the same data type with the corresponding column in 📋 Table A. For example, the first column in 📋 Table A is string and the corresponding column in 📋 Table B should be string as well.
How to union tables?
- Choose a table that you want to union
- Specify columns that match the original table. Be careful that the columns that you selected should have the same data type with the columns in the original table.
Suppose you have 5 tables. Each table records customer orders from May 8 to May 12 separately.
To combine these tables together, you can click the 🔘 SQL Editor action, and paste the following code.
If you don't know how to write SQL queries, you can just go on the 🔘 Union action. Then, choose the table that you want to append and specify column names that match columns in the first table. If the column names are the same, the system can fill it automatically.
The 🔘 Union action will append orders_0509, orders_0510, orders_0511, and orders_0512 to the orders_0508 table.
* Notice that the 🔘 Union action will remove duplicate rows automatically. That is, if two rows have exactly the same values in each column, the program will only keep one of them.
How to identify which table from which table after applying Union?
In our case, it is hard to distinguish which row from which date since these five tables have exactly the same schema. The only difference is the table name. To differentiate these rows, you can use the 🔘 Formula action to create a column called date for each table, and then apply 🔘 Union. Thus, you can know the date of each order.
Moreover, if you don't want to remove duplicate rows automatically after applying Union, you can use this method to keep all repeated rows.
When to use 🔘 Union?
- Datasets must include the same columns with the same data type.
- You don't want to change the schema of the original table.
- You may want to combine 2 or more tables.
Method 2: 🔘 Join
🔘 Join is to merge 📋 tables side by side based on the 🔤 key columns. This method compares values of the key columns in two tables. Rows with the same value in the 🔤 key columns will be concatenated together. Basically, 🔘 Join add more columns to the original table
How to join 📋 tables?
- Choose a table that you want to combine.
- Select one or more 🔤 key columns. Notice that the 🔤 key columns in two tables should have the same data type.
- Select the type of 🔘 join: inner join, left join or right join. To see the difference between three joins, please visit here.
Here, we will use COVID-19 datasets to demonstrate how to combine datasets through Join.
Suppose you want to research how social-distancing and mask-wearing influence the infection rate of COVID-19. You do some research and find some datasets containing information that you need. Here are the tables:
📋 Table A: us_counties (tracks the number of COVID-19 cases and deaths by county and by date)
📋 Table B: mask_use_by_county (records the frequency of mask-wearing by county)
📋 Table C: google_community (This dataset is from Google community mobility report and it tracks how people change their visits to places like grocery stores and parks during COVID-19)
In this case, we want to combine these three tables. Here is our process.
First is to define the relationship between the 📋 us_counties table and the 📋 mask_use_by_county table. Since two tables have a common column called 🔤 county_fips_code, and thus you can combine data based on this column. Moreover, to keep all rows in the 📋 us_counties table, you can choose the left outer method to 🔘 join. Next, 📋 combine us_counties and 📋 google_mobility based on 🔤 county_fips_code and 🔤 date .
In the 🔘 Join action, you can implement the above strategy like this.
Or, You can also paste the following code in the 🔘 SQL editor:
The Join action is to append all columns from a table to another table. If you don't want some of the columns, you can just go on the Hide Columns action to hide them.
When to use 🔘 Join?
- Two datasets only have one or two same columns that can be used as “🔤 key columns” that connect 📋 tables.
- You don't want to lose any columns when combining datasets.
Set up Data Sync
After combining datasets, you can set up Data Sync on Acho to update data automatically. When your database or data source has new data, you don't need to combine these datasets every time. Every time you log into Acho, you can acquire a dataset that you want. Data Sync can help you save a lot of time on repeating works and improve efficiency of the analysis process.
Hope this article helps. Contact us in the chat box if you have any questions.