Data cleaning project

Cleaning data is obligatory for Data analysis.
Le's see what we can do with one of the most famous dataset : Shark attacks !

In [49]:
import pandas as pd
import re
import random

Import and quick look at the dataset

In [50]:
data = pd.read_csv('GSAF5.csv', encoding="latin-1")
data.head()
Out[50]:
Case Number Date Year Type Country Area Location Activity Name Sex ... Species Investigator or Source pdf href formula href Case Number.1 Case Number.2 original order Unnamed: 22 Unnamed: 23
0 2016.09.18.c 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing male M ... NaN Orlando Sentinel, 9/19/2016 2016.09.18.c-NSB.pdf http://sharkattackfile.net/spreadsheets/pdf_di... http://sharkattackfile.net/spreadsheets/pdf_di... 2016.09.18.c 2016.09.18.c 5993 NaN NaN
1 2016.09.18.b 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Chucky Luciano M ... NaN Orlando Sentinel, 9/19/2016 2016.09.18.b-Luciano.pdf http://sharkattackfile.net/spreadsheets/pdf_di... http://sharkattackfile.net/spreadsheets/pdf_di... 2016.09.18.b 2016.09.18.b 5992 NaN NaN
2 2016.09.18.a 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing male M ... NaN Orlando Sentinel, 9/19/2016 2016.09.18.a-NSB.pdf http://sharkattackfile.net/spreadsheets/pdf_di... http://sharkattackfile.net/spreadsheets/pdf_di... 2016.09.18.a 2016.09.18.a 5991 NaN NaN
3 2016.09.17 17-Sep-16 2016 Unprovoked AUSTRALIA Victoria Thirteenth Beach Surfing Rory Angiolella M ... NaN The Age, 9/18/2016 2016.09.17-Angiolella.pdf http://sharkattackfile.net/spreadsheets/pdf_di... http://sharkattackfile.net/spreadsheets/pdf_di... 2016.09.17 2016.09.17 5990 NaN NaN
4 2016.09.15 16-Sep-16 2016 Unprovoked AUSTRALIA Victoria Bells Beach Surfing male M ... 2 m shark The Age, 9/16/2016 2016.09.16-BellsBeach.pdf http://sharkattackfile.net/spreadsheets/pdf_di... http://sharkattackfile.net/spreadsheets/pdf_di... 2016.09.16 2016.09.15 5989 NaN NaN

5 rows × 24 columns

Shape and features
In [51]:
data.shape
Out[51]:
(5992, 24)
In [52]:
data.columns
Out[52]:
Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')
In [53]:
data.dtypes
Out[53]:
Case Number               object
Date                      object
Year                       int64
Type                      object
Country                   object
Area                      object
Location                  object
Activity                  object
Name                      object
Sex                       object
Age                       object
Injury                    object
Fatal (Y/N)               object
Time                      object
Species                   object
Investigator or Source    object
pdf                       object
href formula              object
href                      object
Case Number.1             object
Case Number.2             object
original order             int64
Unnamed: 22               object
Unnamed: 23               object
dtype: object
Except the "Year" and "original order" columns, all the others contain texts.

Columns cleaning

To prepare the Data cleaning, let's check how many empty cells there are in every column.
In [54]:
null_cols = data.isnull().sum()
null_cols[null_cols > 0]
Out[54]:
Country                     43
Area                       402
Location                   496
Activity                   527
Name                       200
Sex                        567
Age                       2681
Injury                      27
Fatal (Y/N)                 19
Time                      3213
Species                   2934
Investigator or Source      15
href formula                 1
href                         3
Unnamed: 22               5991
Unnamed: 23               5990
dtype: int64
The columns "Unnamed: 22" and "Unnamed: 23" are curious and mostly empty. Let's examine what they contain.
In [55]:
data['Unnamed: 22'].fillna(0, inplace=True)
[x for x in data['Unnamed: 22'] if x!=0]
Out[55]:
['stopped here']
In [56]:
data['Unnamed: 23'].fillna(0, inplace=True)
[y for y in data['Unnamed: 23'] if y!=0]
Out[56]:
['Teramo', 'change filename']
As all the non empty cells does not contain any important information, I delete them.
Next, I thought that the column "Case Number.2" could be a duplicate of "Case Number" and "Case Number .1"
In [57]:
data['Case Number.1'].equals(data['Case Number.2'].equals(data['Case Number']))
Out[57]:
False
Apparently it's not the case. Let's dig a little deeper and examine if "Case Number.1" is similar to "Case Number.2"...
In [58]:
data['Case Number.1'].equals(data['Case Number.2'])
Out[58]:
False
Let's dig a little deeper...
In [59]:
data['Case Number.1'].isin(data['Case Number.2']).value_counts()
Out[59]:
True     5982
False      10
Name: Case Number.1, dtype: int64
After checking how many values are different (10),I chose to delete this column. Moreover, it does not contain information I can use for deeper analysis.
I do the same process for "Case Number" and "Case Number.1".
In [60]:
data['Case Number'].equals(data['Case Number.1'])
Out[60]:
False
In [61]:
data['Case Number'].isin(data['Case Number.1']).value_counts()
Out[61]:
True     5979
False      13
Name: Case Number, dtype: int64
The result is the same as before, I will delete this column.
Finally, I examine the ressemblance of "href formula" and "href". Are they duplicate ?
In [62]:
data['href formula'].equals(data['href'])
Out[62]:
False
In [63]:
data['href formula'].isin(data['href']).value_counts()
Out[63]:
True     5939
False      53
Name: href formula, dtype: int64
There is less than 10% difference between the column, I choose to keep the column "href".
In [64]:
data = data.drop(["Unnamed: 22", "Unnamed: 23","Case Number.1","Case Number.2","Time",
                  "href formula",'pdf','original order'], axis=1)
data.head()
Out[64]:
Case Number Date Year Type Country Area Location Activity Name Sex Age Injury Fatal (Y/N) Species Investigator or Source href
0 2016.09.18.c 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing male M 16 Minor injury to thigh N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di...
1 2016.09.18.b 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Chucky Luciano M 36 Lacerations to hands N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di...
2 2016.09.18.a 18-Sep-16 2016 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing male M 43 Lacerations to lower leg N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di...
3 2016.09.17 17-Sep-16 2016 Unprovoked AUSTRALIA Victoria Thirteenth Beach Surfing Rory Angiolella M NaN Struck by fin on chest & leg N NaN The Age, 9/18/2016 http://sharkattackfile.net/spreadsheets/pdf_di...
4 2016.09.15 16-Sep-16 2016 Unprovoked AUSTRALIA Victoria Bells Beach Surfing male M NaN No injury: Knocked off board by shark N 2 m shark The Age, 9/16/2016 http://sharkattackfile.net/spreadsheets/pdf_di...

Data cleaning

"Date" column

I think the "Date" column is important for an analysis. Let's check what it contains..
In [65]:
set(data['Date'].sample(n=30))
Out[65]:
{' 01-Sep-2013',
 '01-Jan-71',
 '05-Apr-71',
 '05-Jul-03',
 '06-Sep-08',
 '07-Dec-59',
 '08-Aug-1899',
 '10-May-90',
 '11-Jun-02',
 '13-Dec-03',
 '13-Jun-49',
 '15-Jun-16',
 '16-Feb-61',
 '17-Jan-1837',
 '17-Mar-09',
 '20-Sep-04',
 '20-Sep-15',
 '23-Aug-64',
 '24-Aug-08',
 '26-Sep-59',
 '27-Jun-09',
 '27-Sep-10',
 '30-Jun-15',
 '30-Mar-36',
 'Apr-70',
 'Reported 09-Jan-1858',
 'Reported 12-Jan-1994',
 'Reported 15-Nov-1921',
 'Reported 16-Apr-1994',
 'Summer 1948'}
In [66]:
len(set(data['Date']))
Out[66]:
5128
Well, it seems that the column contains a lot of unique values but also in many different format.
After many tries, I found that cleaning this column cell by cell was too much of a job. Incidentally, I saw that the column "Case Number" had a format that contains the date recorded (except for the unknown dat, labelled ND). I use this column to create 3 columns with the date info (year, month, day). The column "Year1" will be useful to see if this process worked by comparing it with the existing column "Year".
In [67]:
try:
    data['Year_']=[d[0:4] for d in data['Case Number']]
except:
    data['Year_']='00'
try:
    data['Month']=[d[5:7] for d in data['Case Number']]
except:
    data['Month']='00'
try:
    data['Day']=[d[8:10] for d in data['Case Number']]
except:
    data['Day']='00'
data.loc[data['Year_'].str.contains('(?i)ND'),'Year_']='00'
data.loc[data['Year_'].str.contains('ND'),'Day']='00'
data.loc[data['Year_'].str.contains('ND'),'Month']='00'
data['Year_'] = data['Year_'].replace(['0.02','0.03','0.04','0.07'],0)
data['Year_'] = data['Year_'].astype(int)
data = data[['Case Number','Day','Month','Year_','Year','Date','Type','Country','Area','Location','Activity','Name',
           'Sex ','Age','Injury','Fatal (Y/N)','Species ','Investigator or Source','href']]
data.sample(n=10)
Out[67]:
Case Number Day Month Year_ Year Date Type Country Area Location Activity Name Sex Age Injury Fatal (Y/N) Species Investigator or Source href
4958 1924.02.08 08 02 1924 1924 08-Feb-24 Invalid AUSTRALIA Queensland Currumbin Bathing Frederick Dullroy M NaN Probable drowning & scavenging UNKNOWN NaN 11/02/1924 http://sharkattackfile.net/spreadsheets/pdf_di...
2918 1977.08.05 05 08 1977 1977 05-Aug-77 Invalid USA Florida St. Petersburg, Pinellas County Wading Michael Muradian M 17 Lacerations to hip and leg N Shark involvement not confirmed St. Petersburg Independent, 8/8/1977 http://sharkattackfile.net/spreadsheets/pdf_di...
1410 2004.10.06 06 10 2004 2004 06-Oct-04 Unprovoked REUNION Conservatória District P' tit Paris, Saint-Pierre Body boarding Vincent Motais M 15 Leg severely bitten, surgically amputated N Thought to involve a 2.5 m bull or tiger shark Federation Francaise de Surf http://sharkattackfile.net/spreadsheets/pdf_di...
1989 1997.11.09 09 11 1997 1997 09-Nov-97 Unprovoked AUSTRALIA Western Australia Albany Scuba diving (submerged riding a scooter) Kevin Hulkes M 42 Left arm lacerated when shark grabbed scooter N White shark Daily Telegraph, 11/12/1997, p.5 http://sharkattackfile.net/spreadsheets/pdf_di...
86 2016.03.02 02 03 2016 2016 02-Mar-16 Unprovoked BRAZIL Santa Catarina State Escalerio Beach Balneário Camboriú Swimming Rafael Hermes Thomas M 41 Minor injury to head N Sandtiger shark Misones Online, 3/4/2016 http://sharkattackfile.net/spreadsheets/pdf_di...
1087 2007.11.06 06 11 2007 2007 06-Nov-07 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Joseph Fox M 21 Cut to right knee N 4' to 5' shark S. Petersohn; GSAF' Central Florida News, 11/7... http://sharkattackfile.net/spreadsheets/pdf_di...
5312 1900.12.27 27 12 1900 1900 27-Dec-00 Unprovoked AUSTRALIA New South Wales Middle Harbour, Sydney Bathing Thomas Houstan M NaN FATAL Y NaN Taranaki Herald, 12/28/1900 http://sharkattackfile.net/spreadsheets/pdf_di...
74 2016.04.07.b 07 04 2016 2016 07-Apr-16 Unprovoked USA Florida Florida Keys, Monroe County Fishing Jonathan Lester M 34 Left hand bitten N 5' to 6' shark NaN http://sharkattackfile.net/spreadsheets/pdf_di...
2995 1975.07.05 05 07 1975 1975 05-Jul-75 Provoked AUSTRALIA Western Australia 15 km north of Lancelin, north of Perth Spearfishing Dennis Thompson M 29 Speared shark bit his arm between elbow and s... N 2.4 m [8'] whaler shark Washington Post, 7/7/1975 http://sharkattackfile.net/spreadsheets/pdf_di...
1093 2007.10.13 13 10 2007 2007 13-Oct-07 Unprovoked AUSTRALIA Queensland Holmes Reef Spearfishing Adam Wood M 31 Laceration to calf N Bronze whaler shark news.com.au, 10/14/2007 http://sharkattackfile.net/spreadsheets/pdf_di...
Let's see if the two "Year" columns match...
In [68]:
data[['Year','Year_']].sample(n=20)
Out[68]:
Year Year_
3285 1967 1967
4962 1923 1923
5697 1862 1862
1039 2008 2008
4118 1954 1954
1385 2005 2005
1611 2002 2002
4706 1934 1934
5896 0 0
5930 0 0
3464 1964 1964
3642 1961 1961
1765 2000 2000
441 2013 2013
885 2009 2009
906 2009 2009
1075 2008 2008
5393 1895 1895
4505 1941 1941
828 2010 2010
The two columns seems similar, but are they really ?
In [69]:
data['test'] = data['Year_']==(data['Year'])
data['test'].value_counts()
Out[69]:
True     5987
False       5
Name: test, dtype: int64
It seems that 5 cells are different. Let's examine them.
In [70]:
data.loc[data['Year_']!=(data['Year'])]
Out[70]:
Case Number Day Month Year_ Year Date Type Country Area Location Activity Name Sex Age Injury Fatal (Y/N) Species Investigator or Source href test
2449 1989.07.27 27 07 1989 1969 27-Jul-89 Invalid BERMUDA NaN NaN Scuba diving Russian male M 35 FATAL NaN Shark involvement suspected but not confirmed LA Times, 7/28/1989 http://sharkattackfile.net/spreadsheets/pdf_di... False
3662 1961.07.16 16 07 1961 1971 16-Jul-61 Unprovoked TURKEY Anatolia ?nciralti Beach, ?zmir Swimming ?brahim Karagöz M 16 Left leg injured N NaN C. Moore, GSAF http://sharkattackfile.net/spreadsheets/pdf_di... False
4312 1948.09.17.R 17 09 1948 1848 Reported 17-Sep-1848 Unprovoked TURKEY Adana Province Yumurtalik Swimming Ali Kaymaz M NaN FATAL Y NaN C. Moore, GSAF http://sharkattackfile.net/spreadsheets/pdf_di... False
4983 1923.00.00.a 00 00 1923 1922 1923 Provoked USA New Jersey Ocean City (offshore) Hoisting shark aboard fishing boat male M NaN Shark's tail broke his leg. PROVOKED INCIDENT N NaN Ref in New York Herald Tribune, 8/23/1960; V.M... http://sharkattackfile.net/spreadsheets/pdf_di... False
5043 1900.00.00.R 00 00 1900 1919 Reported to have taken place in 1919 Boating ITALY NaN Savona Fishing NaN M NaN No injury N 13' shark C. Moore, GSAF http://sharkattackfile.net/spreadsheets/pdf_di... False
After reading the web page associated, I retrieve the correct date and correct it in the "Year" column. Typos and mistakes are current in dataset. Data cleaning must always rely on every rows' informations.
In [71]:
data.loc[4983,'Year']=1923
data.loc[3662,'Year']=1961
data.loc[2449,'Year']=1989

The original "Date" column will be be more explicit if it's called "Date comment".
In [72]:
data.rename(columns={'Date': 'Date comment'}, inplace=True)

"Fatal" column

Let's now examine the "Fatal (Y/N)' column. It's important because it will inform us of the proportion of attack that were fatal.
In [73]:
set(data['Fatal (Y/N)'])
Out[73]:
{' N', '#VALUE!', 'F', 'N', 'N ', 'UNKNOWN', 'Y', 'n', nan}
Let's clean these values. As the value UNKNOWN is pertinent, I chose to keep it and rename the column ('Fatal (Y/N/U)')
In [74]:
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].str.strip()
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].fillna('U')
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].str.replace('n', 'N')
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].str.replace('#VALUE!', 'U')
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].str.replace('F', 'Y')
data['Fatal (Y/N)'] = data['Fatal (Y/N)'].str.replace('UNKNOWN', 'U')
data.rename(columns={'Fatal (Y/N)': 'Fatal (Y/N/U)'}, inplace=True)
print(set(data['Fatal (Y/N/U)']))
{'N', 'Y', 'U'}

A new column : "Number of victims"

I create a column "Number of victims" based on column "Name" because it is the only one that really gives the number of person attacked at each event.
In [75]:
data['Number of victims']=data['Name']
data['Number of victims'].fillna('Unknown', inplace=True)
data.head()
Out[75]:
Case Number Day Month Year_ Year Date comment Type Country Area Location ... Name Sex Age Injury Fatal (Y/N/U) Species Investigator or Source href test Number of victims
0 2016.09.18.c 18 09 2016 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County ... male M 16 Minor injury to thigh N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True male
1 2016.09.18.b 18 09 2016 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County ... Chucky Luciano M 36 Lacerations to hands N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True Chucky Luciano
2 2016.09.18.a 18 09 2016 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County ... male M 43 Lacerations to lower leg N NaN Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True male
3 2016.09.17 17 09 2016 2016 17-Sep-16 Unprovoked AUSTRALIA Victoria Thirteenth Beach ... Rory Angiolella M NaN Struck by fin on chest & leg N NaN The Age, 9/18/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True Rory Angiolella
4 2016.09.15 15 09 2016 2016 16-Sep-16 Unprovoked AUSTRALIA Victoria Bells Beach ... male M NaN No injury: Knocked off board by shark N 2 m shark The Age, 9/16/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True male

5 rows × 21 columns

Now, let's deduct the number of victim by the informations present in the column. I focused ont the cells that contained several names or an entire phrase.
In [76]:
unknown=['boat','All on board perished in the crash','a charter fishing boat with James Whiteside and his party','6 m skiboat, occupants: P.A. Reeder & crew','6 m boat, occupants John Winslet & customers','5.4 m boat','5 m skiboat',"""40' fishing cutter""","40' bonito boat","4 boats","""30' cabin cruiser owned by Stefano Catalani""","25-foot cutter","""25' cutter""","""25' rigid-hulled inflatable boat, HI-2""","""24' yacht owened by  C.L. Whitrow""","""22' pleasure boat""","""20' boat of Frank Stocker""",'2 USAF 4-engine planes (an HC-54 & a HC-97) each with 12 onboard collided in mid-air at low altitude and plunged into the Atlantic Ocean',"2 boats","""18' boat of Morris Vorenberg""","""16' launch""","""16' cabin cruiser with 35 hp outboard motor""","""15' boat""","""14' boat, occupant: Jonathan Leodorn""",'14-foot boat Sintra',"""12' dinghy""","""12' boat of Alf Laundy""",'10m boat Lucky Jim',"""10' dinghy""",'10.7 m boat. Occupants: John Capella & friends','Because of a mistaken belief that there were no survivors and several other successive errors, of the 100 to 150 men who survived the sinking, only 11 were rescued. Four of the Sullivan brothers died in the initial blast. ','boat','boat  Marie',"""Boat “Coca Cola”""","boat crew","boat Live N Hope","boat of Al Hattersly","""Boat of Captain Forman White,"boat of Scot's College rowers""","boat of Thomas Baker","boat of Wally Gibbons","Boat owned by Ricardo Laneiro","Boat with tourists onboard","boat x 2",'boat, occupants, P. Groenwald and others','boat, occupants: Jacob Kruger & crew','boat: Lady Charlotte, occupants: C. McSporran & his crew','Crew of Anti-submarine Squadron 23','crew','Haitian refugees perished when their boat capsized in choppy seas','males','males (wearing armor)','males, shark fishermen','mate & crew','multiple bathers','multiple boats including B.J. C. Brunt','No details','Occupants of skin boats','Passenger & crew','Passenger ferry Norman Atlantic','passenger in an automobile','pilot boat, occupants; Captain McAlister & crew','rowboat, ','rowboat, occupants: refugees fleeing Cuba','rowboats attacked by sharks','sailors','several Somali children','Severed human foot washed ashore (in sneaker with 2 Velcro closures)','ski-boat','skiboat','skiff, occupants: Russel J. Coles and others','slaves','surf patrol boat','Theodore Anderson’s captain & rest of crew taken by sharks']
zero_pers=['Jeff Wells claimed he rescued his "daughter" from a 4 m tiger shark']
one_pers=['After 2 days, Ann Dumas, 7,5 months pregnant, died of exposure & exhaution& her body was lashed to raft.','7.2 m boat. Occupant Kelvin Travers','5.75 m wooden boat, occupant: Yoshaiaki Ueda','5.4 m boat, occupant: Ivan Anderton','5 m skiboat; Stephanie','4.9 m fibreglass boat. Occupant: Jack Siverling',"""4.5 m boat, occupant: Rodney Lawn""","4.3 m skiff, occupant: Bob Shay","4-m runabout. Occupant: Allen Gade","""24' boat Shark Tagger Occupant Keith Poe""","22-ft boat.  Occupant Captain Scott Fitzgerald","""17' boat, occupant:  Richard Wade""","""16' skiff, occupant: W.A. Starck, II""","""16' Dreamcatcher. Occupant: Ian Bussus""","""16' boat, occupant: W. Lonergan""","""15-foot boat: occupant Woodrow Smith""","""14' catamaran, occupant: M. Leverenz""","""12' skiff, occupant: E.R.F. Johnson""","""12' boat Sio Ag, occupant: John Riding""",'12 m fishing boat. Occupant: Henry Tervo',"""10' rowboat, occupant: John Stephensen""",'British sailor from the  F-107','Arthur E. Taylor, a navy diver & member+G1053 of a 24-man demolition team',"""boat: 48' charter boat  GonFishin V""","""boat: 69.5' trawler Christy Nichole""",'British sailor from the  F-107','Robert W. McGhee, Private 1st Class, 8th Infantry','Sailor of tuna vessel No.12 Taiyo Marei','U.S. soldier in 161st Infantry Regiment, 25th Infantry Division','William Mills, a British solider, 36th Regiment']
two_pers=['a male & a female','a launch, occupants- Albert Cree & John Blacksall','8m inflatable boat. Occupants: Bhad Battle & Kevin Overmeyer','7 m boat, occupants: Tara Remington & Iain Rudkin','6 m boat: occupants  Stephen & Andrew Crust','5.4 m fibreglass boat, occupants: Robert & James Hogg','5 m skiboat Stephanie, occupants: Fanie Schoeman and Brigadier Bronkhorst','5 m skiboat Graanjan, occupants: Rudy van Graan, Jan de Waal Lombard','5 m boat, occupants: Don & Margaret Stubbs','5 m aluminum dinghy - occupants Mr. & Mrs. Paul Vickery',"""4.3 m skiff: occupants: James L. Randles, Jr. & James Myers""","""35' motor launch, occupants: Bill Fulham & T. Fanning""","""35' cruiser, Maluka II, occupants: Mr & Mrs E. Potts""","""3.5 -metre fibreglass boat, occupants: Harry Ulbrich and another fisherman""","""28' sea skiff, occupants: Alan Moree and another fisherman""","""22' boat, occupants: Saul White & Charles Dillione""","""20' boat; occupants: John Wright & a friend  ""","""19' clinker-built craft. Occupants: Ray Peckham & Mr. L.C. Wells""","""19' boat, occupants: Ray Peckham, L.C. Wells""","""18' launch, occupants: 2 fishermen""","""18' boat, occupants William & Leslie Newton""","""18' boat, occupants Richard Gunther & Donald Cavanaugh""","""18 hp Boston Whaler boat, occupant: G. W. Bane, Jr.""","""17' fishing launch, occupants: A. Burkitt & C. Brooke""","""17' fishing boat; occupants 2 men""","""16' motor launch owned by A. & E. Norton""","""14' open boat: occupants Richard Crew & Bob Thatcher""","""14' dinghy, 2 occupants""","""14' boat, occupants: 2 men""","""12' to 14' dory, occupants: John D. Burns & John MacLeod""","""12' ski, occupants: Bill Dyer & Cliff Burgess ""","""12' open motor boat, occupants Jack Platt & Peter Keyes""","""12' boat,           2 occupants""",'12-foot dinghy Occupants: R. Hunt & a friend.',"""11.6 m fibreglass boat. Occupants: Tony DeCriston & Dan Fink""","""10' skiff. Occupants F. Whitehead & L. Honeybone""","""10' row boat occupants;  Douglas Richards & George Irwin""",'boat, occupants:  Andrew Peterson & Peter Jergerson','boat, occupants:  Mike Taylor & his son, Jack, age 9','boat, occupants: Alf Dean, Jack Hood & Otto Bells','boat, occupants: Boyd Rutherford & Hamish Roper','boat, occupants: C. Nardelli & son','boat, occupants: John Griffiths & Thomas Johnson','boat, occupants: P.D. Neilly & Charlton Anderson','boat, Occupants: William Smith & Thomas Martin','boat; occupants: T & G Longhi','boat:  occupants: Nazzareno Zammit & Emmanuel',"""Boat: 14' Sunfish. Occupants Josh Long &  Troy Driscoll'""",'Boat: occupants: David Lock & his father','Boat: occupants: Tim Watson & Allan de Sylva','Bombardier J. Hall, Private Green of the Sherwood Foresters & Captain C. O. Jennings, R.E. Anti-tank Regiment','Brian Sierakowski & Barney Hanrahan','Bruce Flynn & his dive buddy','canoe, occupants: Chris Newman & Stewart Newman','canoe. Occupants: Doreen Tyrell & Frederick Bates','Captain Baxter & Dick Chares','Carlos Humberto Mendez & Esteban Robles','Colleen Chamberlin & Scott Chamberlin','Colonel B. & Sub-Lieutenant D.','Conway Plough &  Dr. Jonathan Higgs','crayfish boat. Occupants: Dave & Mitchell Duperouzel:','Curran See & Harry Lake','Dave Hamilton-Brown & Ant Rowan','dinghy, occupants: aborigine & lighthouse keeper','dinghy, occupants: Willem & Jan Groenwald','Dinghy. Occupants: Jeff Kurr and Andy Casagrande','Earl Yager & Riley McLachlan','Elena Hodgson & Isaac Ollis','Emil Uhlbrecht & unidentified person','Ensio Tiira & Fred Ericsson, deserters from the French Foreign Legion','Fishing boat. Occupants: Yunus Potur & Ali Durmaz','Gene Franken & Maurice McGregor','hobiecat, occupants: Judy Lambert  & a friend','Ida Parker & Kristen Orr','inflatable boat, occupants: Rudolf Bokelmann and Sakkie Vermeulen','inflatable dinghy, occupants: Ben Cropp, J. Harding & T. Fleischman','inflatable dinghy, occupants: Craig Ward & Gavin John Halse','"Inflatable kayak Occupants:  Andrej Kultan & Steve Hopkins."',"""inflatable rescue boat. Occupants: Lauren Johnson &. Kris O'Neill""",'J.T. Hales and Kenneth J. Hislop, Australian Navy frogmen','James C. Beason & Calvin E. Smith, Jr.  spent 16 hours in life raft','John Parker & Edward Matthews','Juan & Alex Bueno','Karl Pollerer & Eric Eisesenid','Lobster boat, occupants: Mr. P. Valentine & Mr. J. van Schalkwyk ','Louis Zamperini  & Russell Phillips','Mayabrit, an ocean rowing boat. Occupants: Andrew Barnett & J.C.S. Brendana','Monte Robinson & Andrew McNeill','Mr. Child & a Kanaka','Ned & Pawn','Nicaso Balanoba & Julian Dona','Occupants: Ivan Angjus & Stevo Kentera','occupants: John Chandler & Walter Winters','Occupants: Luke Jones & James Sequin','Occupants: Scott & John Fulton','Philip Case & William B. Gray','plywood dinghy, occupants: Jack Deegan & Trevor Millett','Pollione Perrini & Fioravante Perini','rowboat, occupants: Bob Scott & John Blackwell ','rowboat, occupants: James Mitchell-Hedges & Raymond McHenry','Salvatore & Agostino Bugeja','Sgt James Lacasse & Sgt David Milsten, USAF divers','skiboat, occupants: Danie & Fanie Schoeman','skiff with Dr. William T. Healey, Dr. Henry Callahan on board','skiff, occupants: J. & A. Ayerst','Teresea Britton (on raft) & a man (on floating debris)','Tony Moolman and another surfer','Two stowaways on German steamer Vela','Waade Madigan and Dr Seyong Kim','wooden boat, occupants: Jack Bullman & Keith Campbell']
three_pers=['Andong & 2 others','a skill. Occupants George Lunsford & 2 companions','6 m skiboat, occupants: Terry McManus, Dan Clark and Blackie Swart','6 m Seaduce - Occupants: Allen Roberts, Jason Savage & Rob Lindsay.','6 m boat Suki Saki, occupants: E.C. Landells & 2 friends','5.5 m fibreglass boat, occupants: Steven Piggott and Kelvin & Brendan Martin','5 m aluminum boat, occupants: Ben Turnbull, Lia & Neville Parker',"""4 m dinghy, occupants: Cecil Holmes, Chris Augustyn & Allen Varley ""","3.8-m boat with 3 people on board","""21' boat sank. Occupants: Max Butcher, George Hardy & Peter Thorne""","2.4 m rowboat, occupants: Edgar Brown, Jerry Welz & Cornelius  Stakenburg","""17' fishing boat. Occupants, Bubba DeMaurice, his wife & daughter""","""13' dinghy, occupant S. Smith, Leenee Dee & Marie Farmer   ""","""12' boat. Occupants:  Capt. E.J. Wines, Maj. W. Waller & Larry Waller""",'boat, occupants; Carl Sjoistrom & 2 other crew','boat, occupants:  Nels Jacobson & Franklin Harriman Covert','boat, occupants: Captains Charles Anderson, Emit Lindberg & Oscar Benson','Boat: occupants: Matt Mitchell & 2 other people','Burgess & 2 seamen','Captain Angus Brown, his son & brother','Captain Eric Hunt, the cook & a French passenger','catboat. Occupants: Captain Tuppe & 2 young ladiesr','Chris Haenga, Wayne Rangihuna  & Tamahau Tibble','D. R. Nelson, J. Greenberg & S.H. Gruber','Fishing boat  Bingo III , occupants: Michael Perkins, George Hornack & Capt. Lonergan ','"Fishing boat.   occupants: Laz Hüseyin, Ali Osman & Tursun "','Josh Francou, Michael Brister & Paul Bahr','launch, occupant Clive Whitrow, Dick Kuhne & Jim Pergola','Motor boat, occupants: Quintus Du Toit, J.H. van Heerden & J.P. Marais','Occupants: Andrew & Ben Donegan & Joel Ryan, ','Occupants: Jack Munro, Quinton Graham & Donald Shadler','open boat, occupants: Robert Ruark, Hoyle Dosher & Elmer Adkins','Rowgirls, an ocean rowing boat. Occupants: Sally Kettle, Claire Mills & Sue McMillan','Storm King; occupants - George Bridge & 2 sons','William Clinton Carter, Jr. & 2 other men']
four_pers=["""A 'tinnie". Occupants :Paul Sweeny, Paul Nieuwkerdk, John and Mark Kik """,'7 m skiboat Alrehmah III, occupants: Adolph Schlechter & 3 friends','6 m skiboat, occupants: Alex Mamacos, Noel Glintenkamp, Tony Mountifield & Dillon Alexandra','4.8 m skiboat, occupants: C. Cockroft & 3 men','16-foot launch, occupants: George Casey, Jack Byrnes, Julian Reynolds & Denny Laverty','Bry & David Mossman & 2 friends','Claude Hadley, William Grundy, Albert Faulkner & Frederick Faulkner','Fishing vessel. Occupants Gerry Malabago, Mark Anthony Malabago & 2 others','Inflatable dinghy, occupants: Jasmine Wigley, Greg Wilkie, Phil Rourke & Fleur Anderson','Mini Haa Haar, fiberglass boat, occupants: William Catton, Anthony Green, Tony & Kylie Barnes','rowboat, occupant: Joe Whitted, Christopher Quevedo & 2 Willard Brothers','skiboat, occupants: Gustav Boettger, Clive Mason, Keith Murrison & Sweis Olivier','William Olsen, William Peterson, Albert Thomas & R. Zekoski']
five_pers=['Albert Battles, James Dean & 4 crew','4.8-metre skiboat, Occupants: Rod Salm & 4 friends','boat of Dennis Kemp & 4 other occupants','boat Sea Hawk, occupants: R. Roberts & 4 others','boat, occupants:  Mr. Goslin & 4 passengers','Crew of aircraft: McGreevy, Beakley, Rosenthal, Ryan & Hodge','fishing boat, occupants: Simon Hlope & 4 other men','skiboat Double One, occupants: Anton & Michelle Gets, Ray Whitaker, John & Lyn Palmer','surf boat, occupants: Ray Sturdy and 4 other fishermen']
six_pers=['7 m fishing boat Metoo, occupants: Nicky & Paul Goles & 4 friends','6 m catamaran, occupants: Peter Robertson, Beauchamp Robertson, Gerald Spence & 3 crew','5 m inflatable boat, occupants: Kobus Potgieter & 5 friends',"""4.8 m  boat Peggy, occupants: John Oktober, L.A. van Zyl and 4 others""",'dinghy, occupants: T. Shipston, T. Whitta, L Cox, T. Jones, R. Genet & W. Pearce','motor boat, occupants: Mr. & Mrs. Sidney M. Colgate and their three children, Bayard, Caroline and Margaret ','Occupants: Hamza Humaid Al Sahra’a & 5 crew',"""Paul Timothy Lovette, Dr. Neal Beardsley, James C. Russell, Harold H. Mackie, Dale Howard & Diego J. Terres on 42' Navy craft, Marie"""]
seven_pers=['boat, occupants: Joseph Fitzback & 6 passengers','The June, occupants Bunny Pendelbury and crew of 6']
eight_pers=['Wes Wiggins and 7 others on the boat, Sparetime','7.5 m boat, occupants: 8 men','8 US airmen in the water, 1 was bitten by a shark']
for u in unknown:
    data['Number of victims']=data['Number of victims'].replace(u,'Unknown')
for one in one_pers:
    data['Number of victims']=data['Number of victims'].replace(one,1)
for two in two_pers:
    data['Number of victims']=data['Number of victims'].replace(two,2)
for three in three_pers:
    data['Number of victims']=data['Number of victims'].replace(three,3)
for four in four_pers:
    data['Number of victims']=data['Number of victims'].replace(four,4)
for five in five_pers:
    data['Number of victims']=data['Number of victims'].replace(five,5)
for six in six_pers:
    data['Number of victims']=data['Number of victims'].replace(six,6)
for seven in seven_pers:
    data['Number of victims']=data['Number of victims'].replace(seven,7)
for eight in eight_pers:
    data['Number of victims']=data['Number of victims'].replace(eight,8)
data['Number of victims']=data['Number of victims'].replace('boat Swift, occupants: Dolly Samuels & 8 other men',9)
data['Number of victims']=data['Number of victims'].replace('"21 m sportfishing vessel Seabiscuit. Occupants: Captain Louie Abbott & 2 dozen anglers"',25)
data['Number of victims']=data['Number of victims'].replace('"""15 Royal Canadian Airforce crew & 1 passenger"""',16)
data['Number of victims']=data['Number of victims'].replace('135 passengers & 13 crew',148)
data['Number of victims']=data['Number of victims'].replace('No survivors. 189 people were lost',295)
data['Number of victims']=data['Number of victims'].replace('Most were women & children',42)
data['Number of victims']=data['Number of victims'].replace('Mamerto Daanong, Tomas Inog & others',92)
data['Number of victims']=data['Number of victims'].replace('male + 20',21)
data['Number of victims']=data['Number of victims'].replace('At least 29 (and possibly another 71) Somali & Ethiopian refugees',90)
data['Number of victims']=data['Number of victims'].str.extract('(\d+)')
Now that the 'exceptions' are corrected, I fill the remaining cells by 1 because they contain the name of only 1 person.
In [77]:
data['Number of victims'].fillna(1, inplace=True)
data['Number of victims']=data['Number of victims'].astype(int)
set(data['Number of victims'])
Out[77]:
{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 20,
 21,
 23,
 28,
 31,
 40,
 48,
 65,
 68,
 69,
 130,
 200,
 206}

"Name" column

After creating this new column, I notice that in the "Name" column there is some informations that should be in the "Sex" column.
In [78]:
set(data['Name'].sample(n=40))
Out[78]:
{'9 people in the water',
 'Adam Johannes Kriel',
 'Alan Titchenal',
 'Alessandro Russo',
 'Andy Thompson',
 'Anthony Segrich',
 'Bob Purcell',
 'Bruce Bourke',
 'C.J. Wickersham',
 'Carson Jewell',
 'Chris Hogan',
 'Chris White',
 'Douglas Blackmore',
 'Douglas Kuchn, Jr.',
 'Fuentes Gonzalez Escualo ',
 'Hans Kretschmer',
 'Henry Usimewa',
 'Horace Hewison',
 'Huynh Nhu Hoang',
 'Jason Coffman',
 'John Davy & John Pierpoint',
 'Leo Streng',
 'Mary Marcus',
 'Michael Casey',
 'Nkosinathi Mayaba',
 'Occupant: Hasan Olta',
 'Percy Evensen',
 'Rich Reed',
 'Richard Peter Bisley',
 'Scott Johnson',
 'Simeon (Samuel) Ettelton',
 'Whitefield Rolle',
 'boy',
 'female',
 'launch, occupant Clive Whitrow, Dick Kuhne & Jim Pergola',
 'male',
 nan}
Let's check if these informations could help filling the empty cells of the "Sex" column.
First, there is something we really don't like in Data cleaning : a space in the column's name! Let's clean every columns' name to be sure.
In [79]:
data.rename(columns={'Sex ': 'Sex'}, inplace=True)
data.columns = data.columns.str.strip()
After that, let's clean the various spellings of sex present in the "Name" column and extract them in another column named "name_corrected" and fill the empty cells of "Sex"
In [80]:
def correct_name(row):
    if row["Name"] == "girl" or row["Name"] == "Girl":
        return "F"
    elif row["Name"] == "boy" or row["Name"] == "Boy":
        return "M"
    elif row["Name"] == "male" or row["Name"] == "Male":
        return "M"
    elif row["Name"] == "female" or row["Name"] == "Female":
        return "F"    
    else:
        return
    
data["name_corrected"] = data.apply(correct_name, axis=1)
data['Sex'].fillna(data.name_corrected, inplace=True)
Unfortunately, the data on the "name_corrected" column seems to contradict some of the "Sex" column... I choose to state that the "Name" column contains the true data and change the data on the "Sex" column accordingly.
In [81]:
def error_sex(row):
    if row['name_corrected']=='F':
        return 'F'
    elif row['name_corrected']=='M':
        return 'M'
    else:
        return row['Sex']
data['Sex']=data.apply(error_sex, axis=1)
Finally, let's check the data in the "Sex" column
In [82]:
set(data['Sex'])
Out[82]:
{'.', 'F', 'M', 'M ', 'N', None, 'lli'}
A little cleaning is necessary.
In [83]:
data['Sex'] = data['Sex'].str.replace('N', 'M')
data['Sex'] = data['Sex'].str.strip()
data['Sex'] = data['Sex'].fillna('Unknown')
data['Sex'] = data['Sex'].str.replace('lli', 'Unknown')
data['Sex'] = data['Sex'].str.replace('.', 'Unknown')
set(data['Sex'])
Out[83]:
{'F', 'M', 'Unknown'}
I think the sex informations have no place in the "Name" column, so I delete them by 'Unknown'
In [84]:
data['Name'].fillna('Unknown', inplace=True)
data.loc[data['Name'].str.contains('male',na=False),'Name']='Unknown'
data.loc[data['Name'].str.contains('boy',na=False),'Name']='Unknown'
data.loc[data['Name'].str.contains('female',na=False),'Name']='Unknown'
data['Name']=data['Name'].str.replace('Anonymous','Unknown')
data['Name']=data['Name'].str.replace('Male','Unknown')
data['Name']=data['Name'].str.replace('fisherman','Unknown')
data['Name']=data['Name'].str.replace('Female','Unknown')
data['Name']=data['Name'].str.replace('crewman','Unknown')
data['Name']=data['Name'].str.replace('a young Scotsman','Unknown')
data['Name']=data['Name'].str.replace('a native ','')
data['Name'] = data['Name'].replace('\s+', ' ', regex=True)

"Type" column

In [85]:
set(data['Type'])
Out[85]:
{'Boat', 'Boating', 'Invalid', 'Provoked', 'Sea Disaster', 'Unprovoked'}
Surprinsingly, the 'Type' column is almost clean !
In [86]:
data['Type'] = data['Type'].str.strip()
data['Type'] = data['Type'].str.replace('Boating','Boat')
data['Type'] = data['Type'].str.replace('Invalid','Unknown')
print(set(data['Type']))
{'Unknown', 'Boat', 'Sea Disaster', 'Unprovoked', 'Provoked'}

"Country' column

In [87]:
print(set(data['Country']))
{nan, 'PAPUA NEW GUINEA', 'NEW ZEALAND', 'TAIWAN', 'MEDITERRANEAN SEA', 'INDIAN OCEAN', 'JAVA', 'TUVALU', 'SAMOA', 'NEVIS', 'PERSIAN GULF', 'NICARAGUA ', 'GHANA', 'REUNION', 'RED SEA?', 'BELIZE', 'ANDAMAN / NICOBAR ISLANDAS', 'GUYANA', 'TANZANIA', 'BRITISH NEW GUINEA', 'INDIA', 'CAYMAN ISLANDS', 'St Helena', 'NIGERIA', 'CHILE', 'NORTH PACIFIC OCEAN', 'Between PORTUGAL & INDIA', 'INDIAN OCEAN?', 'SENEGAL', 'SAN DOMINGO', 'MEXICO', 'MALAYSIA', 'CROATIA', 'BANGLADESH', 'NETHERLANDS ANTILLES', 'VANUATU', 'BRAZIL', 'GRENADA', 'CAPE VERDE', 'INDONESIA', 'UNITED ARAB EMIRATES (UAE)', 'SRI LANKA', 'Coast of AFRICA', 'UNITED ARAB EMIRATES', 'MICRONESIA', 'EL SALVADOR', 'NEW BRITAIN', 'BAHREIN', 'VENEZUELA', 'FIJI', 'GABON', 'MOZAMBIQUE', 'ATLANTIC OCEAN', 'GEORGIA', 'CEYLON (SRI LANKA)', 'SAUDI ARABIA', 'GREENLAND', 'GUINEA', 'MARSHALL ISLANDS', 'NORTHERN ARABIAN SEA', 'ISRAEL', 'SOLOMON ISLANDS', 'SOUTH ATLANTIC OCEAN', 'SUDAN?', 'ARGENTINA', 'SOLOMON ISLANDS / VANUATU', 'BURMA', 'TASMAN SEA', 'PARAGUAY', 'AZORES', 'SOUTHWEST PACIFIC OCEAN', 'CUBA', 'ALGERIA', 'BERMUDA', 'HONDURAS', 'SWEDEN', 'UNITED KINGDOM', 'TRINIDAD & TOBAGO', 'FEDERATED STATES OF MICRONESIA', 'TONGA', 'NORTH ATLANTIC OCEAN', 'ICELAND', 'ECUADOR', 'KUWAIT', 'THAILAND', 'LEBANON', 'BAHAMAS', 'PALAU', 'PACIFIC OCEAN', 'FRENCH POLYNESIA', 'PUERTO RICO', 'ANTIGUA', 'MADAGASCAR', 'AMERICAN SAMOA', 'SOUTH AFRICA', 'SPAIN', 'EGYPT', 'MEDITERRANEAN SEA?', 'Sierra Leone', 'DOMINICAN REPUBLIC', 'MONACO', 'GUAM', 'RED SEA', 'DJIBOUTI', 'TURKS & CAICOS', 'OCEAN', 'EGYPT ', 'GUATEMALA', 'LIBYA', 'CANADA', 'BAY OF BENGAL', 'KOREA', 'COOK ISLANDS', 'GRAND CAYMAN', 'NEW GUINEA', 'PALESTINIAN TERRITORIES', 'ST. MARTIN', 'SOUTH CHINA SEA', 'MAYOTTE', 'TUNISIA', 'ST. MAARTIN', 'PACIFIC OCEAN ', 'THE BALKANS', 'NAMIBIA', 'HONG KONG', 'DIEGO GARCIA', 'SCOTLAND', 'PHILIPPINES', 'SOUTH PACIFIC OCEAN', 'SUDAN', ' PHILIPPINES', 'RUSSIA', 'JAMAICA', 'ARUBA', 'BRITISH ISLES', 'LIBERIA', 'ANGOLA', 'IRAN', 'CHINA', 'USA', 'CYPRUS', 'MAURITIUS', 'NICARAGUA', 'EGYPT / ISRAEL', 'FRANCE', 'ITALY', 'MARTINIQUE', ' TONGA', 'ADMIRALTY ISLANDS', 'SOMALIA', 'BRITISH WEST INDIES', 'CRETE', 'SEYCHELLES', 'SIERRA LEONE', 'ASIA?', 'MEXICO ', 'JAPAN', 'IRAQ', 'CENTRAL PACIFIC', 'Fiji', 'TURKEY', 'IRELAND', 'YEMEN', 'SYRIA', 'AUSTRALIA', 'NORTH ATLANTIC OCEAN ', 'OKINAWA', 'FALKLAND ISLANDS', 'MALDIVE ISLANDS', 'SLOVENIA', 'PORTUGAL', 'RED SEA / INDIAN OCEAN', 'SINGAPORE', 'VIETNAM', 'BARBADOS', 'NEW CALEDONIA', 'BRITISH VIRGIN ISLANDS', 'NORTHERN MARIANA ISLANDS', 'COSTA RICA', 'COLUMBIA', 'KIRIBATI', 'CARIBBEAN SEA', 'Seychelles', 'MID-PACIFC OCEAN', 'SOUTH KOREA', 'GREECE', 'CURACAO', 'YEMEN ', 'HAITI', 'GULF OF ADEN', 'MONTENEGRO', 'URUGUAY', 'JOHNSTON ISLAND', 'MALTA', 'EQUATORIAL GUINEA / CAMEROON', 'IRAN / IRAQ', 'ITALY / CROATIA', 'KENYA', 'NORWAY', 'WESTERN SAMOA', 'NORTH SEA', 'MID ATLANTIC OCEAN', 'PANAMA', 'ENGLAND'}
This column is also fairly clean. Let's just transform every lower case name to upper case.
In [88]:
data['Country'].fillna('Unknown', inplace=True)
data['Country'] = data['Country'].str.strip().str.upper()
print(set(data['Country']))
{'PAPUA NEW GUINEA', 'NEW ZEALAND', 'TAIWAN', 'MEDITERRANEAN SEA', 'UNKNOWN', 'INDIAN OCEAN', 'JAVA', 'TUVALU', 'SAMOA', 'NEVIS', 'PERSIAN GULF', 'GHANA', 'REUNION', 'RED SEA?', 'BELIZE', 'ANDAMAN / NICOBAR ISLANDAS', 'GUYANA', 'TANZANIA', 'BRITISH NEW GUINEA', 'INDIA', 'CAYMAN ISLANDS', 'NIGERIA', 'CHILE', 'NORTH PACIFIC OCEAN', 'INDIAN OCEAN?', 'SENEGAL', 'SAN DOMINGO', 'MEXICO', 'MALAYSIA', 'CROATIA', 'BANGLADESH', 'NETHERLANDS ANTILLES', 'ST HELENA', 'VANUATU', 'BRAZIL', 'COAST OF AFRICA', 'GRENADA', 'CAPE VERDE', 'INDONESIA', 'UNITED ARAB EMIRATES (UAE)', 'SRI LANKA', 'UNITED ARAB EMIRATES', 'MICRONESIA', 'EL SALVADOR', 'NEW BRITAIN', 'BAHREIN', 'VENEZUELA', 'FIJI', 'GABON', 'MOZAMBIQUE', 'ATLANTIC OCEAN', 'GEORGIA', 'CEYLON (SRI LANKA)', 'SAUDI ARABIA', 'GREENLAND', 'GUINEA', 'MARSHALL ISLANDS', 'NORTHERN ARABIAN SEA', 'ISRAEL', 'SOLOMON ISLANDS', 'SOUTH ATLANTIC OCEAN', 'SUDAN?', 'ARGENTINA', 'SOLOMON ISLANDS / VANUATU', 'BURMA', 'TASMAN SEA', 'PARAGUAY', 'AZORES', 'SOUTHWEST PACIFIC OCEAN', 'CUBA', 'ALGERIA', 'BERMUDA', 'HONDURAS', 'SWEDEN', 'UNITED KINGDOM', 'TRINIDAD & TOBAGO', 'FEDERATED STATES OF MICRONESIA', 'TONGA', 'NORTH ATLANTIC OCEAN', 'ICELAND', 'ECUADOR', 'KUWAIT', 'THAILAND', 'LEBANON', 'BAHAMAS', 'PALAU', 'PACIFIC OCEAN', 'FRENCH POLYNESIA', 'PUERTO RICO', 'ANTIGUA', 'MADAGASCAR', 'AMERICAN SAMOA', 'SOUTH AFRICA', 'SPAIN', 'EGYPT', 'MEDITERRANEAN SEA?', 'DOMINICAN REPUBLIC', 'MONACO', 'GUAM', 'RED SEA', 'DJIBOUTI', 'TURKS & CAICOS', 'OCEAN', 'GUATEMALA', 'LIBYA', 'CANADA', 'BAY OF BENGAL', 'KOREA', 'COOK ISLANDS', 'GRAND CAYMAN', 'NEW GUINEA', 'PALESTINIAN TERRITORIES', 'ST. MARTIN', 'SOUTH CHINA SEA', 'MAYOTTE', 'TUNISIA', 'ST. MAARTIN', 'THE BALKANS', 'NAMIBIA', 'HONG KONG', 'DIEGO GARCIA', 'SCOTLAND', 'PHILIPPINES', 'SOUTH PACIFIC OCEAN', 'SUDAN', 'RUSSIA', 'JAMAICA', 'ARUBA', 'BRITISH ISLES', 'LIBERIA', 'ANGOLA', 'IRAN', 'CHINA', 'USA', 'CYPRUS', 'MAURITIUS', 'NICARAGUA', 'EGYPT / ISRAEL', 'FRANCE', 'ITALY', 'MARTINIQUE', 'ADMIRALTY ISLANDS', 'SOMALIA', 'BRITISH WEST INDIES', 'CRETE', 'SEYCHELLES', 'SIERRA LEONE', 'ASIA?', 'IRAQ', 'JAPAN', 'CENTRAL PACIFIC', 'TURKEY', 'IRELAND', 'YEMEN', 'SYRIA', 'AUSTRALIA', 'OKINAWA', 'FALKLAND ISLANDS', 'MALDIVE ISLANDS', 'SLOVENIA', 'PORTUGAL', 'RED SEA / INDIAN OCEAN', 'SINGAPORE', 'VIETNAM', 'BARBADOS', 'NEW CALEDONIA', 'BRITISH VIRGIN ISLANDS', 'NORTHERN MARIANA ISLANDS', 'COSTA RICA', 'COLUMBIA', 'KIRIBATI', 'CARIBBEAN SEA', 'MID-PACIFC OCEAN', 'SOUTH KOREA', 'GREECE', 'CURACAO', 'HAITI', 'GULF OF ADEN', 'MONTENEGRO', 'URUGUAY', 'JOHNSTON ISLAND', 'MALTA', 'EQUATORIAL GUINEA / CAMEROON', 'BETWEEN PORTUGAL & INDIA', 'IRAN / IRAQ', 'ITALY / CROATIA', 'KENYA', 'NORWAY', 'WESTERN SAMOA', 'NORTH SEA', 'MID ATLANTIC OCEAN', 'PANAMA', 'ENGLAND'}

"Species" column

In [89]:
print(set(data['Species'].sample(n=20)))
{nan, '"grey-colored shark"', '40 to 50 sharks attacked survivors in the water', ' ', "4.6 m [15'] shark", 'Silvertip shark', '4.5 m white shark', 'White shark', "Bull shark, 1.8 m to 2.1 m [6' to 7'] ", "Mako shark (aka bonito shark) 1.2 m [4'] "}
This column needs a big cleaning but its informations are not pertinent for an analysis in my opinion. I decide to clean it just a little by filling the empty cells with 'Unknown'.
In [90]:
data['Species'].fillna('Unknown', inplace=True)
Finally, let's delete our test columns.
In [91]:
data.drop(['Year_','name_corrected'], axis=1, inplace=True)

Reordering and recording

The cleaning part is complete, let's now reorder the columns and save the dataset on a new file !
In [92]:
data.head()
Out[92]:
Case Number Day Month Year Date comment Type Country Area Location Activity Name Sex Age Injury Fatal (Y/N/U) Species Investigator or Source href test Number of victims
0 2016.09.18.c 18 09 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Unknown M 16 Minor injury to thigh N Unknown Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True 1
1 2016.09.18.b 18 09 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Chucky Luciano M 36 Lacerations to hands N Unknown Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True 1
2 2016.09.18.a 18 09 2016 18-Sep-16 Unprovoked USA Florida New Smyrna Beach, Volusia County Surfing Unknown M 43 Lacerations to lower leg N Unknown Orlando Sentinel, 9/19/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True 1
3 2016.09.17 17 09 2016 17-Sep-16 Unprovoked AUSTRALIA Victoria Thirteenth Beach Surfing Rory Angiolella M NaN Struck by fin on chest & leg N Unknown The Age, 9/18/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True 1
4 2016.09.15 15 09 2016 16-Sep-16 Unprovoked AUSTRALIA Victoria Bells Beach Surfing Unknown M NaN No injury: Knocked off board by shark N 2 m shark The Age, 9/16/2016 http://sharkattackfile.net/spreadsheets/pdf_di... True 1
In [93]:
data = data[['Case Number','Day','Month','Year','Date comment','Type','Country','Area','Location','Activity','Name','Sex','Age','Injury','Fatal (Y/N/U)','Number of victims','Species','Investigator or Source','href']]
In [94]:
data.to_csv('GSAF5_clean.csv', index=False)

The cleaning process is the most time consuming task of a Data analyst but it is essential. This dataset is now ready for analysis !