API scraping project


Scraping of the API 'the-on-api' that contains datas about The Lord of the Rings trilogy

In [254]:
import requests
import json
import pandas
import matplotlib.pyplot as plt
import numpy as np
from pandas.io.json import json_normalize
pandas.set_option('display.max_rows', 500)
colors='YlGn_r'
colors2='YlGn'
colors3='dimgrey'
I chose to work on this API because this trilogy is one of my favourites. I was interested in studying the characters' data. The API displays every character that is named in the books. I will do some gender and body counting by race and gender.
In [255]:
url = 'https://the-one-api.herokuapp.com/v1/character'
headers = {"Authorization": "Bearer 16TtGwhAHbABjZtS6bgu"}
response = requests.get(url, headers=headers)
results = response.json()
LOTR_characters = json_normalize(results)
LOTR_characters = json_normalize(LOTR_characters.docs[0])
LOTR_characters.head()
Out[255]:
_id height race gender birth spouse death realm hair name wikiUrl
0 5cd99d4bde30eff6ebccfdf3 Human Male SA 192 Unnamed wife SA 603 Númenor Tar-Amandil http://lotr.wikia.com//wiki/Tar-Amandil
1 5cd99d4bde30eff6ebccfdf7 Human Male SA 1800 Unnamed wife SA 2221 Númenor Tar-Atanamir http://lotr.wikia.com//wiki/Tar-Atanamir
2 5cd99d4bde30eff6ebccfdf9 Human Male SA 1634 Unnamed wife SA 2035 Númenor Tar-Ciryatan http://lotr.wikia.com//wiki/Tar-Ciryatan
3 5cd99d4bde30eff6ebccfdfa Human Male SA 350 Unnamed wife SA 751 Númenor Tar-Elendil http://lotr.wikia.com//wiki/Tar-Elendil
4 5cd99d4bde30eff6ebccfdfc Human Male SA 3035 Unnamed wife SA 3255 Númenor Tar-Palantir http://lotr.wikia.com//wiki/Tar-Palantir
In [256]:
LOTR_characters.shape
Out[256]:
(933, 11)
There are 933 characters listed and 11 features describe them.
Let's examine one of our favorite character : Bilbo Baggins !
In [257]:
LOTR_characters[LOTR_characters.name=='Bilbo Baggins']
Out[257]:
_id height race gender birth spouse death realm hair name wikiUrl
478 5cd99d4bde30eff6ebccfc38 1.25m (4'1") Hobbit Male 22 September ,TA 2890 None Unknown (Last sighting 29 SeptemberTA 3021,) (... Brown, later white Bilbo Baggins http://lotr.wikia.com//wiki/Bilbo_Baggins

Gender count

First, some data cleaning.
In [258]:
LOTR_characters['gender']=LOTR_characters['gender'].replace('Males','Male')
LOTR_characters['gender']=LOTR_characters['gender'].replace('Most likely male','Male')
LOTR_characters['gender']=LOTR_characters['gender'].replace('male','Male')
LOTR_characters['gender']=LOTR_characters['gender'].replace(np.nan,'Unknown')
LOTR_characters['gender']=LOTR_characters['gender'].replace('','Unknown')
In [259]:
all_women = LOTR_characters[LOTR_characters.gender=='Female']['_id'].count()
all_women
Out[259]:
137
In [260]:
round(all_women/LOTR_characters.shape[0],2)*100
Out[260]:
15.0
Only 15.0 % of the characters named by J.R.R Tolkien are female !
In [261]:
all_men = LOTR_characters[LOTR_characters.gender=='Male']['_id'].count()
all_men
Out[261]:
638
In [262]:
round(all_men/LOTR_characters.shape[0],2)*100
Out[262]:
68.0
68.0 % of the characters are men (the rest have an unknown gender)
In [263]:
LOTR_characters['gender'].value_counts().plot.pie(autopct='%1.0f%%',pctdistance=0.5, labeldistance=1.2, cmap=colors, textprops={'color':colors3})
Out[263]:
<matplotlib.axes._subplots.AxesSubplot at 0x235851895c0>

Body count

Let's do some data cleaning first on the 'death' column. We can see that the cells are sometimes filled with 'still alive' so it distorts the body count.
In [264]:
set(LOTR_characters['death'].sample(n=100))
Out[264]:
{'',
 'Before ,FO 63',
 'Early ,Fourth Age',
 'Early ,Second Age',
 'FA 420',
 'FA 455',
 'FA 456',
 'FA 460',
 'FA 506',
 'FA 538',
 'FO',
 'FO 112',
 'FO 2',
 'FO 95',
 'Fate officially unknown',
 'February 293019',
 'First Age',
 'Fourth Age',
 'Immortal',
 'March 15 ,3019',
 'March 17 ,3019',
 'None',
 None,
 'Possibly end of ,First Age',
 'Presumably sometime after ,TA 3019',
 'SA 2386',
 'SA 2962',
 'SR 1236',
 'Still alive',
 'Still alive, departed over the sea in the early Fourth Age',
 'Still alive, departed to ,Aman ,FO 120',
 'TA 1029',
 'TA 1149',
 'TA 1272',
 'TA 1432',
 'TA 1621',
 'TA 1743',
 'TA 1856',
 'TA 2129',
 'TA 2327',
 'TA 2477',
 'TA 2489',
 'TA 2842',
 'TA 2872',
 'TA 2946',
 'TA 2946, (SR 1346',
 'TA 2977',
 'TA 2994',
 'TA 3000',
 'TA 3004'}
In [265]:
LOTR_characters['death']=LOTR_characters['death'].replace('None','')
LOTR_characters['death']=LOTR_characters['death'].str.replace(r'(^.*alive.*$)','', case=False)
LOTR_characters['death']=LOTR_characters['death'].replace('',np.nan)
Body count :
In [266]:
LOTR_characters.death.isnull().value_counts()
Out[266]:
False    573
True     360
Name: death, dtype: int64
Among all the characters present in the books/movies, 360 are still alive and 573 are dead.
In [267]:
round(LOTR_characters[LOTR_characters.death.isnull()].count()['_id']/LOTR_characters.shape[0],2)*100
Out[267]:
39.0
It means that only 39.0 % of the characters named by J.R.R Tolkien are still alive !
Let's now see how many women are among those 39%...
In [268]:
alive_women = LOTR_characters[LOTR_characters.gender=='Female']['death'].isna().sum()
alive_women
Out[268]:
65
In [269]:
round(alive_women/all_women,2)*100
Out[269]:
47.0
47.0 % of the female characters named by J.R.R Tolkien are stil alive. It's not a bad number considering the events depicted.
Let's compare this proportion to male characters.
In [270]:
alive_men = LOTR_characters[LOTR_characters.gender=='Male']['death'].isna().sum()
alive_men
Out[270]:
141
In [271]:
round(alive_men/all_men,2)*100
Out[271]:
22.0
In comparaison, only 22.0 % of the male characters named by J.R.R Tolkien survived !
In [272]:
LOTR_characters[LOTR_characters['death'].notnull()]['gender'].value_counts().plot.pie(autopct='%1.0f%%',pctdistance=0.5, labeldistance=1.2, cmap=colors, textprops={'color':colors3})
Out[272]:
<matplotlib.axes._subplots.AxesSubplot at 0x235851d1ba8>

Body count by race

First, some data cleaning !
In [273]:
LOTR_characters['race']=LOTR_characters['race'].replace('Hobbits','Hobbit')
LOTR_characters['race']=LOTR_characters['race'].replace('Ents','Ent')
LOTR_characters['race']=LOTR_characters['race'].replace('Men','Human')
LOTR_characters['race']=LOTR_characters['race'].replace('Orcs','Orc')
LOTR_characters['race']=LOTR_characters['race'].replace('Eagles','Eagle')
LOTR_characters['race']=LOTR_characters['race'].replace('Uruk-hai,Orc','Uruk-hai')
LOTR_characters['race']=LOTR_characters['race'].replace('Great Eagle','Eagle')
LOTR_characters['race']=LOTR_characters['race'].replace('Elves','Elf')
LOTR_characters['race']=LOTR_characters['race'].replace('Dwarves','Dwarf')
LOTR_characters['race']=LOTR_characters['race'].replace('Men,Wraith','Human')
LOTR_characters['race']=LOTR_characters['race'].replace('','Unknown')
LOTR_characters['race']=LOTR_characters['race'].replace(np.nan,'Unknown')
In [274]:
set(LOTR_characters['race'])
Out[274]:
{'Ainur',
 'Balrog',
 'Black Uruk',
 'Dragon',
 'Dragons',
 'Dwarf',
 'Eagle',
 'Elf',
 'Ent',
 'Goblin,Orc',
 'God',
 'Great Eagles',
 'Great Spiders',
 'Half-elven',
 'Hobbit',
 'Horse',
 'Human',
 'Maiar',
 'Maiar,Balrogs',
 'Orc',
 'Orc,Goblin',
 'Raven',
 'Stone-trolls',
 'Unknown',
 'Uruk-hai',
 'Urulóki',
 'Vampire',
 'Werewolves',
 'Wolfhound'}
In [275]:
LOTR_characters['race'].value_counts().count()
Out[275]:
29
Tolkien had a great deal of imagination, he depicted 29 different species !
If we label the races with the least characters as 'Other', we can vizualise the proportion of the principal races (Human, Elf, Hobbit, Dwarf and Orc).
In [278]:
principal_races=['Human','Elf','Hobbit','Dwarf','Orc']
LOTR_characters.loc[~LOTR_characters['race'].isin(principal_races), 'race'] = 'Other'
LOTR_characters['race'].value_counts().plot.pie(autopct='%1.0f%%',pctdistance=0.5, cmap=colors2, textprops={'color':colors3})
Out[278]:
<matplotlib.axes._subplots.AxesSubplot at 0x2358527fc50>
One final vizualsation of the remaining characters by races :
In [279]:
LOTR_characters[LOTR_characters['death'].notnull()]['race'].value_counts().plot.pie(autopct='%1.0f%%',pctdistance=0.5, cmap=colors2,textprops={'color':colors3})
Out[279]:
<matplotlib.axes._subplots.AxesSubplot at 0x235852cfe80>
And with the gender distinction :
In [280]:
gender = LOTR_characters.groupby(['race','gender'])['_id'].count().unstack('gender')
gender.plot(kind='bar', stacked=True, cmap=colors2)
Out[280]:
<matplotlib.axes._subplots.AxesSubplot at 0x235853392e8>
In conclusion, the Lord of the Ring trilogy definitely takes place in the age of Men ! Not only did the story depict the adventures of more male than female characters but it is also the human race that is the most represented.
It's also unfortunately logical that it's the human race that suffered the most deaths.