Thursday, March 28, 2013

[TALEND*] Combine Multiple Files Into One




Combine multiple files into one



This tutorial will outline the steps required to build a Talend job to append multiple csv files of the same format into a single combined master csv file.

Sample csv files used in this tutorial can be downloaded from these links:





Lets get started by assuming we are trying to load data from CSV files (like above) that contain the following fields:

• ItemId
• ItemName
• Qty
• DateofPurchase
• Price


The file would look something like this:




We want to load this file into a CSV file with a schema as follows:




Next we take the following  steps to build our Talend Open Studio solution:


Step 1: Open Talend
Open Talend and create or open an existing project



Step 2: Create a new job
Right click on Job Designs in the Repository window and select “Create job”
 
Name the job “item_load”






Step 3: Create a File Delimited repository element
Now we need to create a repository item for our CSV file. To do this click the arrow next to “Metadata” in the Repository window and right click on “File delimited” and select “Create file delimited”.





Enter a name for your example file schema and click next:






Select your example file that we saw earlier (or use your own) by clicking the “Browse” button.


Click Next







On the next screen select field separator as comma (as per your file data) and select the checkbox for “Set heading row as column names”, then click “Refresh Preview”. Click Next.






Talend has now generated an estimated schema, review this schema and make any changes as you would like, then click Finish.






Step 5:  Design your job
In this step we are going to design our job to connect our CSV file to our Master CSV file.

Open the job we created in Step 2 by double clicking the name of the job under Job Designs in the Repository window.

In the palette window on the right hand side type “fileinputDelimited” into the search box, then drag and drop the component “tFileInputDelimited” into the job window in the center of the screen.





Select the tFileInputDelimited Component in the job window and then select the “Component” tab near the bottom middle of the screen.

Click the drop down box next to “Property type” and select “Repository”
Then click the button with three dots that appears and select the delimited file from the repository that we created in Step 3.







Select a tMap component from the palette on the right hand side of the screen and drag it into the job design window

Right click on the tFileinputDelimited component, select Row->Main and connect a row to the tMap component.





Select a tFileoutputDelimited component from the Palette and drag it over to the job design window.






Click on the tFileOutputDelimited component to select it, then navigate to the Component tab in the lower middle of the screen

Under File Name, set the file location where you want to store the master file
 
Set the field separator to set your required delimiter.

Set append check box to append the data to existing file instead of creating new file every time.

Note: while running job for multiple files, All input files data will be appended to master file with this append check box

Check include header to write master file with header column.







Next right click the tMap component, Select “Row” then “New Output”.
Connect this new output to the tFileOutputDelimited component
Name the output row- “output1″

Now your tMap is connected to the tFileOutputDelimited component, double click the tMap component to open up the tMap editor.

Click and drag the data fields from the “row1″ panel to the left of the screen to the corresponding “output1″ data fields. This tMap editor enables you to map to fields from the CSV input to the correct output fields in Master CSV files.







Click “Apply” then “OK” to save the changes and return to the Job Design window

Now your job design is complete and we just need to run the job to load the file into Master CSV File




Step 6: Run your Job
In the group of tabs in the lower middle of the screen, select the “Run” tab





Under Execution, click the “Run” button

Your job will load and then run and you will see how many rows were processed in the job design window.





Congratulations! You have now loaded your one file into your Master File.


This job is great but it is only set up to process a specific CSV file which we designate. Our next step is to extend this job so that it will process multiple CSV files of the same format from a specified directory.




Step 7: Set up the File Directory for Iteration
Select the component tFileList from the Palette and drag it into the job design window

Click on the tFileList component and then click on the “Component” tab near the bottom center of the screen

Under “Directory”, select the directory that contains the excel files by clicking on the “…” button.







Step 8: Connect tFileList
Right click on the tFileList component, select Run -> Iterate
Connect this “Iterate” link to the tFileInputDelimited component






Step 9: Edit the tFileInputDelimited field
Select the tFileInputDelimited component and then navigate to the Component tab near the bottom middle of the screen

Change the Property type from “Repository” to “Built In”

Under “File name / Stream” delete the filepath string

Type in tfilelist then press cntrl + space

Select tFileList_1.CURRENT_FILEPATH







Step 10: Run the job
Congratulations! You are now ready to run your job.

Make sure you have multiple CSV files in your directory
Select the “Run” tab near the bottom middle of the screen and click the “Run” button

Now you will see all your files are processed one by one and loaded into the Master File






No comments:

Post a Comment