I've 2 csv files that will be provided to me every month. Both these files have the following fields: restaurant name, address, city & cuisine. These 2 files have a discrepancy in the restaurant name field, for ex: in the first csv file, the restaurant name is bel-air hotel & in the 2nd csv file it's hotel bel-air. However, the address & city field is the same in both the files which are 701 stone canyon rd. & bel air respectively. I need to do the following:
1) Combine these 2 data sources & ensure it produces accurate information- I was thinking to create a primary key such as restaurant_id & have a different table that would hold the restaurant information such as the restaraunt_id, name, address & city. Also, the other table would have the restaraut_id as foreign key & cuisine. Does this design make sense? If yes then I was thinking to dump these 2 files into a storage service such as Amazon s3 & then create a sql script that would copy the data from the s3 location to the db tables on the redshift followed by scheduling an ETL update job on a monthly time cadence. I can always directly import the csv files to OLAP cube & do all the ETL cleaning, curation, massaging, transformation & validation over there as well but just wondering how would I perform the data cleaning in the latter case
2) Resolve the discrepancy between the 2 data sources- Since I've created a separate table for the restaurant information, the discrepancy would cease to exists but I would like to know your thoughts about it. Is there any other approach to this?
3) Frame excellent questions for the data owners/ business owners-
The questions would cater around the 7Ws - who, what, when, where, why, how & how many. These seem to be the dimension tables but I don't seem to have an idea about what the fact table would be as there isn't any business event that's been mentioned to me. Can you please shed some light on this front as well?
4) The result of this analysis would be used as a feedback loop for the data owners to correct their source data. Build a report by listing all the columns & their definition as a feedback loop for data owners to fix the data errors in the source systems for continuous data improvements
Customer support service by UserEcho