Friday 19 February 2021

Cleaning Data in Excel

I am working on project, where I have to load data in SQL and then create a data warehouse.

The data that I have got its in Excel files, I have to clean the data before loading in SQL server.

There can be many issues, like Extra Spaces, Duplicate Rows, Upper and Lower Case characters, Numbers are stored as character, Errors, etc


Lets start here with some tips


Remove Extra Spaces:

Extra Spaces are difficult to spot & correct. Multiple spaces may be easy, but trailing spaces are pretty tough. Trailing spaces are blank spaces at the end of the statement or word which are not followed by any other character.


Here’s an easy way to spot & eliminate such errors:


Syntax: TRIM(text)

Steps:

Consider data with four cells with different spacing errors.

Now select a column & type “TRIM(“

Now select the cell you want to correct (in matters of spaces).

The cell will be corrected. If there are other erroneous cells sequentially aligned, drag the fixed cell till the point, you want to check & correct.

This easy step can save you time!


Blank cells:

Blank cells are troublesome because they often create errors while creating reports. And, people usually want to replace such cells with 0, Not Available or something like that. But replacing each cell manually on a large data table would take hours. Luckily, there’s an easy way to tackle this problem.


Steps:

Select the entire Data (you want to treat)

Press F5 (on keyboard)

A dialogue box will appear > Select “Special”

Select “Blanks” & click “OK”

Now, all blank cells will be highlighted in pale grey color, out of which one cell would be white with a different border. That’s the active cell, type the statement you want to replace in blank cells.

Hit “Ctrl+Enter”


NOTE:

At the last step, if “Enter” only is pressed, then the value will be inserted only in the active cell. So remember to press “Ctrl+Enter.”


Convert Numbers Stored as Text into Numbers:

When we import data from files, other sources, databases, text, etc. During transit, data might get affected. Also, some have a habit of using an apostrophe before numerical values, which is considered as text in Excel. Such minor data conversion can drastically affect calculations.


Suppose there are three values “70, ’70, 80”. When we compare 70 and 80 (70<80), the result is “TRUE.” But when we compare “apostrophe 70 & 80” (‘70<80), the problem starts. Here the result will be FALSE as the text will be rated higher than any number. To eliminate such errors, here’s a trick.


Steps:

Select any blank cell & type 1

Select that cell & hit “Ctrl+C”

Now select your data set & go to Paste > Paste Special

In Paste Special, select “Multiply” option in the “Operation” category

Click “OK”


Here it multiples every single value to “1”. And anything multiplied by 1 is the same number. But this trick also takes care of the apostrophe numerical.


Remove Duplicates:

Elimination of duplicate data is necessary for the creation of unique data & less usage of storage. In duplication, you can either highlight it or delete it.


A) Highlight Duplicates:


Select the data & go to Home > Conditional Formatting > Highlight Cell Rules > Duplicate Values

A dialogue box will appear (Duplicate Values), Select Duplicate & formatting color

Press OK

All duplicate values will be highlighted!


B) Delete Duplicates:


Select the data & go to DATA > Remove Duplicates

A dialogue box will appear (Remove Duplicates), tick columns whose duplicates need to be found.

Remember to have a click on “My data has headers” (if your Data has headers) or else column heads will be considered as data & duplication search will be applied on it too.

Click OK!


Duplicate values will be removed! Suppose you select 4 of 4 columns. Then that four columns rows should also match or else; they won’t be considered duplicate.


Highlight Errors:

While creating reports or dashboards, you might face a few arithmetical errors (like divisional errors). Such errors are easy to spot if the Data is small. But for big data, it’s complicated. So to get rid of such mistakes, you can go for two ways: Conditional Formatting or Go to Special.


A) Using Conditional Formatting:

Select the Data

Go to Home > Conditional Formatting > New Rule

Within New Rule, Select “Format only cells that contain.”

In Rules, Select “Errors” & Click on “Format”

Select any color & click OK

Hit the final “OK” button

All the cells with errors are highlighted & now are easy to spot.


B) Go to Special:


Select the Data

Press F5

Click on “Special”

A dialogue box appears (Go to Special), Select Formulas

Now you get four options in Formulas, deselect all options except “Errors”

Click OK! Now all errors are selected, you can delete them manually or replace a statement.

If you wish to replace, then type the statement at active cell & hit “CTRL+ENTER.”


Change Text to Lower/Upper/Proper Case:

While importing data, we often find names in irregular forms like a lower, upper case, or sometimes mixed. Such errors are not easy to eliminate manually. Here’s a fingertip trick to bring back the consistency.


LOWER(text)

UPPER(text)

PROPER(text)


Steps:


Just type the formula you want to use, suppose “LOWER(“ and select the cell whose case needs to be changed.

Hit “CTRL+ENTER.”

The case has been changed & consistent

Drag down to do the same for other cells.

Similarly for UPPER() & PROPER()


Parse Data Using Text to Column:

Sometimes the received Data has texts filled in one cell, only separated by punctuations. Usually, the addresses are cramped in one cell separated by a comma. To distinguish values in separate cells, we can use “Text to Column.”

Steps:

Select the Data

Go to Data> Text to Column

A dialogue box will appear (Convert Text to Columns Wizard – Step 1 of 3), select Delimited or Fixed Width as per your convenience.

Delimited is to be selected if the width isn’t fixed, click “NEXT”

In Delimiters tick the option which separates your text in the cell. Suppose “Norwich Cathedral, Norwich, UK,” here three values are separated by commas. So we will select “Comma” for this example. And, deselect rest options.

View the preview & click on “NEXT”

Select Column Data Format & destination cell address

Click “FINISH”


Spell Check:

Spelling mistakes are common in text files & PowerPoint. However, MS points out such errors by underlining it with colorful dashes. And, MS Excel doesn’t have such feature. But you can use it below steps:



Select the Data

Press “F7”

A dialogue box appears, which shows you the possible wrong word & it’s the possible correct spelling. Click on “Change,” if you agree with the suggestion.

Check & change till it says “Spell check complete. You’re good to go!”


Delete all Formatting:

Suppose you want to clear all the formats, including highlights & borders. You can do this by selecting the data & go to HOME > Clear (in editing group) > Clear Formats. It will clear the formats & you get standard content without highlights or borders. Similarly, you can clear Content, Comments, Hyperlink, or entire data (using Clear All).



Use Find & Replace to Clean Data in Excel


A) Changing Cell References:


Press “CTRL+H” to open “Find and Replace”

Now in Replace > “Find What” (change the reference range too) “Replace With”

Suppose Find What: $B to Replace With: $C

Click on “Replace All”

Similarly finding & replacing using reference range we can clean the Data


B) Find & Change Specific Format:


Press “CTRL+H”

Select “Options”

Now go to “Format” of “Find What.” Here you can specify the format or choose a format from the cell. Suppose you select a format.

Now it will show you the preview for “Find What.”

Click on “Format” of “Replace With.” Suppose we go for “Format…”

Now select format, example: Number, Alignment, Font, Border, Fill, Protection.

Suppose we select Color then select any color to fill the column header cell.

Click on Replace All

Instantly the format has been changed!


C) Removal of Line Breaks:


Suppose we have a data where it is separated by line breaks (same cell but different rows). To remove these line breaks, follow the below steps:


Press “CTRL+H”

Find and Replace dialogue box will appear, press “CTRL+J”

Go to the replace with box & type a single space

Click Replace All

All rows will be managed in one row within the same cell!


D) Removal of Parenthesis:


Select the Data

Press “CTRL+H”

Type (*) in “Find What” (This will consider all characters within parenthesis)

Leave the Replace With column empty & click Replace

Parenthesis characters are removed!


No comments:

Post a Comment

MS SQL Server and its Editions.

Microsoft SQL Server  Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a ...