Tutorials Forums
     Tutorials Videos
        Sign Up Now For FREE
Welcome, Guest
Username Password: Remember me

Self Cleansing Database
(1 viewing) (1) Guest
SQL SERVER HELP, SQL SERVER TUTORIALS, SQL SERVER PROGRAMMING, SQL SERVER CODE


SQL Server is a relational DMBS written for the Windows platform by Microsoft. SQL Server is a high-end and high-performance solution, for applications that interact with a database. Its use has been increasing because of the number of web applications that feature a data base back end.
  • Page:
  • 1

TOPIC: Self Cleansing Database

Self Cleansing Database 11 Feb 2010 13:28 #273

Introduction

This script shows how data in the table can be cleansed/enhanced using other records in the same table. It is essential that you have enough records in the database to get correct results.
In this example we will see how to fix values in Titles field of Contacts table if it also has a Gender field.

Using the code
With CREATE TABLE statement we find popular Forename-Gender combinations in our table and store them in Forename_Gender table.
--------------------------------------------------------------- Create table with all forenames grouped and counted	  ---------------------------------------------------------------DROP TABLE Forename_Gender;CREATE  Forename_GenderASSELECT Forenames,  Gender, COUNT(*) AS AmountFROM ContactsWHERE  	NOT Forenames LIKE '% %'	AND LENGTH(Forenames)>2	AND c.Gender<>'U' -- Gender is not unknownGROUP BY res.Forenames, res.Gender/

Then we need to remove records where same forename can be given to both boy and girl. I have used 1% threshold, i.e. if both male and female genders have more than 1% of records (amount) in the forename group, then this Forename-Gender combinations cannot be used and will be deleted from Forename_Gender table.
---------------------------------------------------------------	Delete records where forename can be both --		male and female-------------------------------------------------------------DELETE FROM forename_gender fWHERE EXISTS(SELECT * FROM forename_gender WHERE	Forenames=f.Forenames 	AND Gender<>f.Gender	AND AMOUNT*100>f.amount -- there is more than 1%	-- of people with different gender and same forename.	-- NOTE: you must change 100 to other number to use 	-- different threshold. I.e. 50 will mean that 2% of records	-- can have other Gender without being taken into account.)/


Then I delete unpopular forenames - less than 5 records in the table.
---------------------------------------------------------------	Need more than 5 records having same Gender-------------------------------------------------------------DELETE FROM forename_genderWHERE Amount<5/


And finally we can update Gender and Title fields with values from Forename_Gender.
---------------------------------------------------------------	UPDATE Contacts.Gender field where it is set to U-------------------------------------------------------------UPDATE Contacts CSET 	C.Gender=(SELECT Gender 		FROM Forename_Gender 		WHERE C.Forenames=Forenames)-- SELECT COUNT(*) FROM Contacts CWHERE 	C.Gender='U'	AND EXISTS (SELECT * 			FROM Forename_Gender 			WHERE C.Forenames=Forenames)/---------------------------------------------------------------	UPDATE Contacts.Title fields where it is not set--	or set to Mr/Ms-------------------------------------------------------------UPDATE Contacts CSET 	Title = (SELECT DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')		 FROM Forename_Gender 		 WHERE C.Forenames=Forenames)-- SELECT COUNT(*) FROM Contacts CWHERE 	UPPER(NVL(C.Title,'MR/MS'))='MR/MS'	AND CONTACT_TYPE='C'	AND EXISTS (SELECT * 			FROM Forename_Gender 			WHERE C.Forenames=Forenames					AND C.Gender=Gender)/
  • David
  • OFFLINE
  • CE Newbi e
  • Posts: 35
  • Karma: 21
CodersEngine

Re: Self Cleansing Database 06 Sep 2010 10:57 #387

The project is expected to solve 3 problems:

DDT in the Food Chain. One set of facts that is of great importance to ZipcodeZoo is the observation data that we will be acquiring from GBIF and using for our maps. A published map of Panthera Leo (African Lion) makes it clear that we should question the accuracy of information about the geographic distribution of species in the wild. GBIF created none of these errors – they came up the food chain from their providers. If it is true that up to 30% of published observation records are wrong, that’s a lot of DDT in the food chain. The ideal system would allow data cleaning to move through the provider network even more efficiently than errors move through that network. And the ideal system would allow corrections to happen at any time, from any point in the network.
Recurring Nightmares. In a partnership between a web site such as ZipcodeZoo or EOL and a data provider, regular transfers of data from one level of the system to the next are expected. ZipcodeZoo, for instance, will be importing 158 million GBIF records 6-12 times a year. A properly designed information tagging system could prevent errors from returning. Rather than manually clean an imported database of these returning errors, a site such as EOL or ZipcodeZoo could elect to not import or use any record with pending comments, or any record marked as “incorrect” by an appropriate party.
Communication Mayhem. In a data provider network, one provider gathers information from other providers, organizes it and aggregates it, and passes this on to yet another provider. GBIF receives data from a provider such as a museum, and passes it on to a provider who will display it on a web site or otherwise use it. At the end of this process is a fourth party, a user viewing a web site such as EOL.org or ZipcodeZoo.com. That user might spot an error, and report it to someone, as happened recently. A Peace Corps volunteer in Mexico wrote ZipcodeZoo with this message: “University of Kansas has 2 specimens of Phrynosoma, one braconnieri and one taurus, from Puebla Mexico listed as being from Namibia. I have sent them a message about it but your database has picked up that error.” So she sent 2 messages, and ZipcodeZoo should now communicate with her, with GBIF, and with someone at the University of Kansas. The communication problem could be solved by allowing the user to send a comment directly to the central database, with the comment automatically attached to the questionable data element. At intervals, participants would review their commented records, and mark (some or all) for removal. Such a record would be replaced, rather than edited.

______________________________________________________


Want to get-on Google's first page and loads of traffic to your website? Hire a SEO specialist from Ocean Groups seo specialist
  • chrisadam
  • OFFLINE
  • CE Newbie
  • Posts: 2
  • Karma: 0
  • Page:
  • 1
Time to create page: 0.70 seconds