Space Launches Dataset Exploration

The number of space launches and satellites in orbit has increased drastically in the last few years, and we would like to see how marked that trend is compared to previous decades.

We’ll be exploring launch data from the Gunter Space Page, that Gunter kindly authorized me for using. In the Chronology section, we have every space launch for the year, with a list of payloads the launch carried. After a quick analysis of the columns in the dataset and some cleaning on the columns with useful information, we’ll create some visualizations and answer some questions.

Each of our rows of our dataset will have a “payload” (something that was put in orbit, hopefully, or was lost if the launch failed), in a launch. Each launch has a unique ID. Let’s see what other columns we have. (this dataset was partially cleaned for simplicity beforehand)

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
df = pd.read_csv("../input/spacelaunches/clean_launch_data.csv")
df.head()
nation type / application operator contractors equipment configuration propulsion power lifetime mass orbit date id vehicle site failed type, application
0 ussr technology NaN npo energia 2 transmitters pressurized sphere with four antennas none batteries 21 days 84 kg 228 km × 947 km, 65.0� 1957-10-04T03:00:00.000Z 1957 α (001) Sputnik (1) Ba LC-1/5 False NaN
1 ussr biological resaerch NaN NaN NaN r-7 core with added payload none (after burnout) batteries 6 days 508 kg (payload) 212 km × 1660 km, 65.3� 1957-11-03T03:00:00.000Z 1957 β (002) Sputnik (1) Ba LC-1/5 False NaN
2 usa science nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� 1957-12-06T03:00:00.000Z 1957-F01 Vanguard CC LC-18A True NaN
3 usa research, magnetosphere, micro meteorites nasa jet propulsion laboratory (jpl) cosmic-ray detection package, temperature sens... aerodynimcally shaped satellite body attached ... none batteries NaN 14 kg (#1-3), 17 kg (#4-5) 356 km × 2548 km, 33.24� (#1); 186 km × 2799 k... 1958-02-01T03:00:00.000Z 1958 α (001) Juno-1 CC LC-26A False NaN
4 usa science nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� 1958-02-05T03:00:00.000Z 1958-F01 Vanguard CC LC-18A True NaN

For each payload, the row indicates what country it belongs to (“nation”). But that’s not necessarily the country from where the payload was launched. That information is part of the “site” column. More information here: the vehicle that was used in the launch, and if the launch was successful.

Besides that, we know the type of the satellite, contractor, configuration, mass, the equipment the satellite carries, the propulsion method it uses, how it is powered, its expected lifetime, and mass. Many of the columns need some cleaning so we won’t concern ourselves a lot with them. Let’s see which of the columns will be most useful

df.describe()
nation type / application operator contractors equipment configuration propulsion power lifetime mass orbit date id vehicle site failed type, application
count 9304 9165 6994 8189 5217 6339 6891 9088 4237 7140 6768 9359 9359 9359 9359 9359 137
unique 176 506 1067 1072 1139 686 349 125 392 1383 1612 5247 5956 497 207 2 6
top usa communication gru tsskb ? cubesat (3u) ? solar cells, batteries ~ 1 year (iss orbit); 2-3 years (sso) ~260 kg geo 2017-02-15T03:00:00.000Z 2017-008 Soyuz-U CC SLC-40 False earth observation, traffic monitoring
freq 3753 1821 870 750 961 709 1729 2372 383 478 1219 104 104 796 575 8566 111

As we gathered from peeking at the first rows, most of the columns need more cleaning in order to be useful: there are many unique values (power, contractor, type of the satellite), so we won’t be able to make much sense of them without more work: there seem to be 349 different types of propulsion and power configurations, the “type / application column” seems to be indicated as “type , application” in some cases and as “type / application” in others.

Looking at the equipment column, it’s easy to tell it needs much more work. It won’t used for now. Instead, we’ll focus on columns regarding the launch to get a sense of long term trends in the access countries have been gaining to space, and leave the information about individual payloads for some other time.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9359 entries, 0 to 9358
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   nation              9304 non-null   object
 1   type / application  9165 non-null   object
 2   operator            6994 non-null   object
 3   contractors         8189 non-null   object
 4   equipment           5217 non-null   object
 5   configuration       6339 non-null   object
 6   propulsion          6891 non-null   object
 7   power               9088 non-null   object
 8   lifetime            4237 non-null   object
 9   mass                7140 non-null   object
 10  orbit               6768 non-null   object
 11  date                9359 non-null   object
 12  id                  9359 non-null   object
 13  vehicle             9359 non-null   object
 14  site                9359 non-null   object
 15  failed              9359 non-null   bool  
 16  type, application   137 non-null    object
dtypes: bool(1), object(16)
memory usage: 1.2+ MB

The nation column has some missing values. Not a lot (some 0.5% of all data), but that column will be useful later. There is no reasonable way of inferring the nation that owns the payload in rows that don’t have that information available, so we’ll complete with “Unknown”.

df["nation"] = df["nation"].fillna("unknown")
df["equipment"].dropna().value_counts().head(30)
#Best stay away from this for now!
?                                                                                                         961
ku/ka-band payload, optical inter-satellite link                                                          478
ps0, ps1 or ps2 imaging payload                                                                           383
ku-band payload                                                                                           129
see above                                                                                                 117
stratos (gps radio occultation payload), sense (ais-receiver), airsafe (asd-b payload, #78 and later)     111
package                                                                                                    96
ftor-2 camera package, kust-12m elint package                                                              81
ftor-4 camera package                                                                                      76
l-band payload, ka-band cross-links, ka-band downlinks, ads-b payload, ais-receiver (on 58 satellites)     75
ftor-6 camera package                                                                                      61
priroda-4 imaging system (2 × kfa-1000, 3 × kate-200)                                                      53
j-1 camera, index camera                                                                                   52
radar                                                                                                      37
tv, ir, ac                                                                                                 36
6 c-band transponders, 1 ku-band transponder                                                               35
3 delta-1 c-band transponders, 3 delta-2 c-band transponders                                               35
docking system                                                                                             30
zhemchug-4 camera                                                                                          30
24 c-band transponders                                                                                     30
1 (+1) uhf 200 w transponder, c-band uplink                                                                27
none                                                                                                       27
priroda-3                                                                                                  27
mural-camera, index camera                                                                                 26
ais-receiver                                                                                               25
camera                                                                                                     23
tv, ir, sm, rmk-2                                                                                          21
32 ku-band transponders                                                                                    21
j-3 camera, disic camera                                                                                   17
12 c-band transponders                                                                                     17
Name: equipment, dtype: int64

Let us start working with the date and the failed column. First, take a look at how many launches we had every year. Launches have been more frequent these last few years, and we’d expect to see that reflected in the data. We’ll be using the “failed” column to see how many launches were successful; the “site” column, that denotes the launch site for the launches (and indirectly, the country); and the nation column: what country the payload being put in orbit belongs to.

First, let’s extract the year and the month of each launch, and calculate the launch success rate each year. After some quick visualizations, both to get a sense of the data and to find some trends that seem evident, we’ll be in position to answer a few questions.

#Fixing the type / application column
df["type"] = df["type / application"].fillna("").astype(str)  +df["type, application"].fillna("").astype(str) 
df = df.drop(["type / application","type, application"], axis = 1)
df.head()

df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month

First analysis - number of launches, and success rate

We’ll plot how many launches took place each year, regardless of how many payloads it was carrying, to see if the data makes sense. For that, since the dataset has a row for each payload, we’ll first grab the year and id column, and determine the number of separate space launches each year using groupby.

#on "id" we have the id of the launch. We'll only keep those two columns, and drop the duplicates to get the unique launches each year.
per_year_df = df[ ["year","id"] ].drop_duplicates()
#Now, we group by year, get the size, and with that plot the number of launches.
per_year_df.groupby("year").size().plot(kind="line",figsize=(16,4))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b66091810>

png

Interestingly enough, launches have indeed become more frequent lately, but are still far from the peak in the Cold War. If we plot the total number of launches each decade, the 70s and 80s are the most active. That information seems to make sense. We also see that for 2020 we have a drop in the number of launches, since the year is not done yet.

per_year_df.hist(bins=[1960,1970,1980,1990,2000,2010,2020])
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f8b65e92990>]],
      dtype=object)

png

Let’s take a closer look now at the success and the failures of the launches. For that, we’ll grab the id and failed column of our dataset, and add the number of items launched into space, using size and groupby.

#Failure Rate, another plot in the same showing failure trend
per_year_df = df[ ["year","id","failed"] ].drop_duplicates()
per_year_df.head()
year id failed
0 1957 1957 α (001) False
1 1957 1957 β (002) False
2 1957 1957-F01 True
3 1958 1958 α (001) False
4 1958 1958-F01 True

Now, plot them to see the percentage of successful launches each year. For that it will be necessary to re-arrange our data set, and add a success rate column.

failures_per_year = per_year_df.groupby(["year","failed"]).size().unstack().reset_index();
failures_per_year["success"] = (failures_per_year[False])
failures_per_year["failures"] = (failures_per_year[True])
failures_per_year["total_launches"] = (failures_per_year[True]+failures_per_year[False])
failures_per_year["success_rate"] = 100*failures_per_year[False]/failures_per_year["total_launches"]
failures_per_year = failures_per_year[ ["year","success","failures","total_launches","success_rate"] ]

#failures_per_year.tail(10)

success_year= pd.melt(failures_per_year,id_vars=["year"],value_vars=["total_launches","failures","success_rate"],var_name="variable")

#success_year.tail(10)

“Melt” is used to unpivot our data, turning “success”,”failures”, and “success_rate” columns to values of a single column, named variable, in order to use a grid and analyze each of these variables separately.

facets = sns.FacetGrid(success_year, col="variable",sharex=False,sharey=False)
facets.map(sns.lineplot,"year","value")
<seaborn.axisgrid.FacetGrid at 0x7f8b65e4b990>

png

As one would expect, success rate has increased a lot since the first few years, and is around 90-95% in these last few years. With our first global view done, let’s try and turn our attention to this same information, but looking at the countries launching these satellites, and taking a closer look at the payload being put into orbit.

Launch Sites - Cleaning, and joining

Let’s now take a look at our site information. It indicates the country where the launch site is located, it has a code, and a “details” column. “suo” indicates it’s a Suborbital Launch Site, and orb that it is an orbital launch site. The “t” at the end indicates the launch site (https://space.skyrocket.de/directories/launchsites.htm) The rest is probable left over html from web scrapping.

sites = pd.read_json("../input/spacelaunches/sites.json")
sites.head(25)
country raw details code name
0 Algeria Ha=  Centre interarmées d’essais d’engins spéc... suo Ha Centre interarmées d’essais d’engins spéciau...
1 Algeria Reg=  Reggane, Adrar Province suo Reg Reggane, Adrar Province
2 Antarctica DdU=  Dumont d'Urville Station, Terre Adelie † suot DdU Dumont d'Urville Station, Terre Adelie †
3 Antarctica Mat=  Base Aerea Teniente Benjamin Matienzo, N... suot Mat Base Aerea Teniente Benjamin Matienzo, Nunat...
4 Antarctica McM=  McMurdo Station, Ross Ice Shelf, New Zea... suot McM McMurdo Station, Ross Ice Shelf, New Zealand...
5 Antarctica Mol=  Molodezhnaya Station, Enderby Land, Aust... suo Mol Molodezhnaya Station, Enderby Land, Australi...
6 Antarctica Rot=  Rothera Research Station, British Antarc... suot Rot Rothera Research Station, British Antarctic ...
7 Antarctica Sip=  Siple Station, Ellsworth Land † suot Sip Siple Station, Ellsworth Land †
8 Antarctica Sho=  Showa Station (Syowa Station), East Ongu... suot Sho Showa Station (Syowa Station), East Ongul Is...
9 Antarctica VcM=  Vicecomodoro Marambio Station, Seymour I... suot VcM Vicecomodoro Marambio Station, Seymour Islan...
10 Antigua and Barbuda Ant=  Antigua Air Force Station, Antigua † suot Ant Antigua Air Force Station, Antigua †
11 Argentina PB=  Base Naval Puerto Belgrano, Punta Alta, C... orbp option PB Base Naval Puerto Belgrano, Punta Alta, Coro...
12 Argentina Cha=  Centro de Ensayo y Lanzamiento de Proyec... suo Cha Centro de Ensayo y Lanzamiento de Proyectile...
13 Argentina LPa=  Las Palmas, Chaco † suot LPa Las Palmas, Chaco †
14 Argentina MCh=  CELPA Atlantica (CELAT), Mar Chiquita, B... suo MCh CELPA Atlantica (CELAT), Mar Chiquita, Bueno...
15 Argentina PdI=  Puente del Inca, Mendoza † suot PdI Puente del Inca, Mendoza †
16 Argentina PIn=  Polo Espacial Punta Indio, Capetina, Bue... suot PIn Polo Espacial Punta Indio, Capetina, Buenos ...
17 Argentina Ser=  Poligono de Salinas Grandes, Serrezuela,... suot Ser Poligono de Salinas Grandes, Serrezuela, Cor...
18 Argentina Tar=  Tartagal, Salta † suot Tar Tartagal, Salta †
19 Argentina VR=  Villa Reynolds Air Base, Mercedes, San Lu... suot VR Villa Reynolds Air Base, Mercedes, San Luis ...
20 Australia AnP=  Anna Plains, Western Australia † suot AnP Anna Plains, Western Australia †
21 Australia Car=  Carnarvon Quobba Site, Western Australia † suot Car Carnarvon Quobba Site, Western Australia †
22 Australia CaN=  Carnarvon North Site, Western Australia † suot CaN Carnarvon North Site, Western Australia †
23 Australia Lan=  Lancelin, Perth, Western Australia † suot Lan Lancelin, Perth, Western Australia †
24 Australia Wo=  Woomera Instrumented Range, Woomera, Sout... orb Wo Woomera Instrumented Range, Woomera, South A...

The only useful information right now is in the country and the code, in order to join them to our launch dataset. We’ll keep those columns, and build a new dataframe with the launch data. Tweaking the site column in our main dataframe will be necessary since it includes the launch pad (Ba LC 1/5, for instance)

df["site_code"] = df["site"].apply(lambda x: x.split(" ")[0]).astype("string")
print("Has null values? ", len(df["site_code"])!=len(df["site_code"].dropna()))
sites.set_index("code",inplace=True)
Has null values?  False

In order to properly join both datasets, we need to see if the site launch codes specified in our launch data matches the site code.

#Check different values for the site codes.
launch_site_codes = set(df["site_code"].unique())
site_codes = set(sites.index.unique())
unmatched_codes = list(launch_site_codes-site_codes)
unmatched_codes
['',
 'WI,',
 'In,',
 '@BaS',
 'Va,',
 'SLC-36A',
 'CC,',
 'Kw,',
 'Mo,',
 'BaS',
 'Ed,',
 'CCK',
 'SLC-36B',
 'Ga,',
 '@YS']

Some are empty, some have commas, others seem to have capitalization issues, other include a @. Let’s define a function to replace our simple “split” operation then, since the cleaning operation won’t be as simple as we thought. We’ll apply that function to the site column, and check again for unmatched values.

def clean_site_code(raw_site):
    return re.sub("@|,|-.*","",raw_site.split(" ")[0]).strip()

def unmatched_codes(launch_site_codes,site_codes):
    unmatched_code_list = list(set(launch_site_codes)-set(site_codes))
    return unmatched_code_list
df["site_code"] = df["site"].apply(clean_site_code).astype("string")
unmatched = unmatched_codes(df["site_code"].unique(),sites.index.unique())
df[df["site_code"].isin(unmatched)]["site_code"].value_counts()
CCK    473
YS       7
BaS      5
SLC      4
         1
Name: site_code, dtype: Int64

Almost done, but more information is needed about that CCK site, since it accounts for almost all of our missing values. Also, it’s necessary to check for the rest of the missing values by visiting the page of the launch year.

#What payloads have incorrect launch sites (besides CCK)? With the year of the launch site, we'll revisit our source of the data.

df[ (df["site_code"].isin(unmatched)) & ( df["site_code"] != "CCK")][["year","site_code","site"]]
year site_code site
4897 1998 BaS BaS
4898 1998 BaS BaS
4899 1998 BaS BaS
5310 2002 LC-1/5
5480 2004 SLC SLC-36A
5484 2004 SLC SLC-36B
5495 2004 SLC SLC-36B
5517 2004 SLC SLC-36A
5573 2005 BaS @BaS
5648 2006 BaS @BaS K-84
8493 2019 YS @YS (Tai Rui)
8494 2019 YS @YS (Tai Rui)
8495 2019 YS @YS (Tai Rui)
8496 2019 YS @YS (Tai Rui)
8497 2019 YS @YS (Tai Rui)
8498 2019 YS @YS (Tai Rui)
8499 2019 YS @YS (Tai Rui)

I went back to explore the source of the data, and checked the years with problematic launch sites. (https://space.skyrocket.de/doc_chr/lau2019.htm, https://space.skyrocket.de/doc_chr/lau1998.htm, etc) . CCK seems to correspond to Cape Canaveral. Since the “CCK” is listed as a Launch Site different from “CC”, I’ll manually add it to our sites list. There are only a few sites, it’s reasonable to go over them one by one and review the other problematic launch codes. Remember the goal is to determine which country they belong to. “LC-1/5” and “SLC-x”, appear to be space lauch complexes in Cape Canaveral. BaS seems to be Ba, and YS is China, not completely sure where. We’ll add an entry for that site as well.

new_sites = [
    ["USA","NASA John F. Kennedy Space Center, Cape Canaveral, Florida, USA","orb","NASA John F. Kennedy Space Center, Cape Canaveral, Florida, USA"],
    ["China","Unknown, China","orb","Unknown, China"]
]
cck_ys = pd.DataFrame(new_sites
                   ,columns=["country","raw","details","name"],
                   index=["CCK","YS"])

sites = sites.append(cck_ys)

unmatched = unmatched_codes(df["site_code"].unique(),sites.index.unique())
df[df["site_code"].isin(unmatched)][["year","site_code"]]

#df.loc[df["site"]=="LC-1/5", "site_code" ] = "CC"
df.loc[df["site_code"]=="", "site_code" ] = "CC"
df.loc[df["site_code"]=="SLC", "site_code" ] = "CC"
df.loc[df["site_code"]=="BaS", "site_code" ] = "Ba"
#df[df["site_code"]=="SLC"]

print("Number of unmatched site codes: ", len(df[df["site_code"].isin(unmatched)][["year","site_code","site"]]))
Number of unmatched site codes:  0

0 unmatched site codes. Everything is in order to join both datasets.

joined = df.join(sites, on="site_code", lsuffix="site__")
joined.head()

nation operator contractors equipment configuration propulsion power lifetime mass orbit ... site failed type year month site_code country raw details name
0 ussr NaN npo energia 2 transmitters pressurized sphere with four antennas none batteries 21 days 84 kg 228 km × 947 km, 65.0� ... Ba LC-1/5 False technology 1957 10 Ba Kazakhstan Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
0 ussr NaN npo energia 2 transmitters pressurized sphere with four antennas none batteries 21 days 84 kg 228 km × 947 km, 65.0� ... Ba LC-1/5 False technology 1957 10 Ba USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
1 ussr NaN NaN NaN r-7 core with added payload none (after burnout) batteries 6 days 508 kg (payload) 212 km × 1660 km, 65.3� ... Ba LC-1/5 False biological resaerch 1957 11 Ba Kazakhstan Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
1 ussr NaN NaN NaN r-7 core with added payload none (after burnout) batteries 6 days 508 kg (payload) 212 km × 1660 km, 65.3� ... Ba LC-1/5 False biological resaerch 1957 11 Ba USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
2 usa nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� ... CC LC-18A True science 1957 12 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...

5 rows × 23 columns

There’s a bit of a situation here. Launchpads that are in what was the Soviet Union are listed twice apparently. Ideally, we’d like to consider the country as “USSR” if the launch was before it collapsed. We’ll need to determine when our data first registers “Russia” or another ex-USSR country, establish that as the “collapse_date” of the Soviet Union, and modify our sites dataset accordingly.

ussr_df = df[(df["nation"] == "russia") | (df["nation"] == "ussr")][ ["year","nation"] ]
grid = sns.FacetGrid(ussr_df,col="nation")
grid.map(plt.hist,"year")
<seaborn.axisgrid.FacetGrid at 0x7f8b6718ed50>

png

Apparently that won’t do. Some satellites are considered to be from “Russia” in the the 80s, and some are considered soviet well into the 90s. Remember, the data isn’t perfect. So we’ll say the launch sites belong to the USSR for dates before the dissolution of the Soviet Union. Another thing to do is finding other ambiguous site codes and fix them.

sites.reset_index().head()
#sites[sites["country"].str.contains("USSR") | sites["country"].str.contains("Uzb")]
sites.reset_index().groupby("index").filter(lambda x: len(x) > 1)

index country raw details name
85 Ba Kazakhstan Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
121 Sin North Korea Sin=  Sinpo Shipyard, South Hamgyong Province mis Sinpo Shipyard, South Hamgyong Province
123 Sin North Korea Sin=  Sinpo Shipyard, South Hamgyong Province mist Sinpo Shipyard, South Hamgyong Province
194 WI USA WI=  Wallops Flight Facility, Wallops Island, ... orb Wallops Flight Facility, Wallops Island, Vir...
195 WI USA WI=  Mid-Atlantic Regional Spaceport (MARS), W... orb Mid-Atlantic Regional Spaceport (MARS), Wall...
198 Ba USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
204 Kat USSR / Russia Kat=  Kattakurgan missile base, Samarkand regi... mis Kattakurgan missile base, Samarkand region, ...
228 Kat Uzbekistan Kat=  Kattakurgan missile base, Samarkand regi... mis Kattakurgan missile base, Samarkand region, ...

We found some other issues as well: There seems to be a duplicate value for North Korea (a temporary site, that became permanent later?), and a misscategorized one in the USA (Wallops Flight Facility).

sites.reset_index(inplace=True)
sites.head()

index country raw details name
0 Ha Algeria Ha=  Centre interarmées d’essais d’engins spéc... suo Centre interarmées d’essais d’engins spéciau...
1 Reg Algeria Reg=  Reggane, Adrar Province suo Reggane, Adrar Province
2 DdU Antarctica DdU=  Dumont d'Urville Station, Terre Adelie † suot Dumont d'Urville Station, Terre Adelie †
3 Mat Antarctica Mat=  Base Aerea Teniente Benjamin Matienzo, N... suot Base Aerea Teniente Benjamin Matienzo, Nunat...
4 McM Antarctica McM=  McMurdo Station, Ross Ice Shelf, New Zea... suot McMurdo Station, Ross Ice Shelf, New Zealand...
sites.at[123,"index"] = "Sin-2"
sites.at[195,"index"] = "W-MARS"


sites.groupby("index").filter(lambda x: len(x) > 1)
sites.set_index("index",inplace=True)
sites.head()

ambiguous_sites = sites[ sites["country"]=="USSR / Russia" ]

df.loc[ (df["year"]<1992) & (df["site_code"].isin(ambiguous_sites.index) ), "site_code" ] = df["site_code"] + "-USSR"

ambiguous_sites.index = ambiguous_sites.index + "-USSR"
sites = sites.append(ambiguous_sites)


sites.tail()
country raw details name
index
Vla-USSR USSR / Russia Vla=  Vladimirovka Poligon suo Vladimirovka Poligon
Vo-USSR USSR / Russia Vo=  Vostochniy, Amurskaya Oblast' orb Vostochniy, Amurskaya Oblast'
Vor-USSR USSR / Russia Vor=  Vorkuta Sovetskiy, Vorkuta, Komi Republic mis Vorkuta Sovetskiy, Vorkuta, Komi Republic
Ya-USSR USSR / Russia Ya=  Yasnaya missile base, Chita Oblast mis Yasnaya missile base, Chita Oblast
Yur-USSR USSR / Russia Yur=  Yurya missile site mis Yurya missile site

Let’s now join again, and plot the launches for each year, by country.

joined = df.join(sites, on="site_code")
#print("Joined dataframe length: ",len(joined))
joined.head(10)
nation operator contractors equipment configuration propulsion power lifetime mass orbit ... site failed type year month site_code country raw details name
0 ussr NaN npo energia 2 transmitters pressurized sphere with four antennas none batteries 21 days 84 kg 228 km × 947 km, 65.0� ... Ba LC-1/5 False technology 1957 10 Ba-USSR USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
1 ussr NaN NaN NaN r-7 core with added payload none (after burnout) batteries 6 days 508 kg (payload) 212 km × 1660 km, 65.3� ... Ba LC-1/5 False biological resaerch 1957 11 Ba-USSR USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
2 usa nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� ... CC LC-18A True science 1957 12 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
3 usa nasa jet propulsion laboratory (jpl) cosmic-ray detection package, temperature sens... aerodynimcally shaped satellite body attached ... none batteries NaN 14 kg (#1-3), 17 kg (#4-5) 356 km × 2548 km, 33.24� (#1); 186 km × 2799 k... ... CC LC-26A False research, magnetosphere, micro meteorites 1958 2 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
4 usa nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� ... CC LC-18A True science 1958 2 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
5 usa nasa jet propulsion laboratory (jpl) cosmic-ray detection package, temperature sens... aerodynimcally shaped satellite body attached ... none batteries NaN 14 kg (#1-3), 17 kg (#4-5) 356 km × 2548 km, 33.24� (#1); 186 km × 2799 k... ... CC LC-26A True research, magnetosphere, micro meteorites 1958 3 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
6 usa nasa naval research laboratory (nrl) NaN NaN none solar cells, batteries NaN 1.5 kg 654 km × 3969 km, 34.25� ... CC LC-18A False science 1958 3 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
7 usa nasa jet propulsion laboratory (jpl) cosmic-ray detection package, temperature sens... aerodynimcally shaped satellite body attached ... none batteries NaN 14 kg (#1-3), 17 kg (#4-5) 356 km × 2548 km, 33.24� (#1); 186 km × 2799 k... ... CC LC-5 False research, magnetosphere, micro meteorites 1958 3 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...
8 ussr NaN NaN see above NaN none batteries, solar cells ~ 1 month 1327 kg 217 km × 1864 km, 65.2� ... Ba LC-1/5 True scientific 1958 4 Ba-USSR USSR / Russia Ba=  Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam orb Baikonur (Tyuratam, NIIP-5, GIK-5), Tyuratam
9 usa nasa naval research laboratory (nrl) NaN NaN none batteries NaN 10 kg NaN ... CC LC-18A True science 1958 4 CC USA CC=  Cape Canaveral Air Force Station, Eastern... orb Cape Canaveral Air Force Station, Eastern Te...

10 rows × 23 columns

Exploring the country data

We’ll aggregate the data we have with the agg function, which is more flexible than .count(), or .sum(), and allows us to specify what to do with each column: Nation has the country the individual payload belongs to.

With that we can begin answering some questions, such as how many nations put payloads in space each year, how many countries have launch capabilities, and how many payloads are placed into space with each launch.

unique_launches = joined.groupby("id",as_index=False).agg({'nation': 'nunique', 
                                                           'type': 'nunique', 'year': 'first', 'month': 'first', 
                                                           'country': 'first',"site":"first", "power":"count","vehicle":"first","date":"first","failed":"first","equipment":"count"})

unique_launches.loc[ (unique_launches["year"]>=1992) & (unique_launches["country"]=="USSR / Russia" ), "country" ] = "Russia"

unique_launches = unique_launches.rename(columns={"power":"launches","equipment":"payload_total"})

agg_map = {"country":"nunique","type":"nunique","launches":"count","nation":"sum","vehicle":"nunique","failed":"sum","payload_total":"sum"}
countries_year = unique_launches.groupby("year",as_index=False).agg(agg_map)
successful_countries_year = unique_launches[unique_launches["failed"]==False].groupby("year",as_index=False).agg(agg_map)

countries_year["success_rate"] = 100 - 100*(countries_year["failed"] / countries_year["launches"])
countries_year["countries_per_launch"] = 100*(countries_year["country"] / countries_year["launches"])
countries_year["nations_per_launch"] = 100*(countries_year["nation"] / countries_year["launches"])

countries_year.head()
year country type launches nation vehicle failed payload_total success_rate countries_per_launch nations_per_launch
0 1957 2 1 3 3 2 1.0 1 66.666667 66.666667 100.0
1 1958 2 1 28 28 9 21.0 20 25.000000 7.142857 100.0
2 1959 2 1 25 25 11 12.0 16 52.000000 8.000000 100.0
3 1960 2 2 38 38 18 19.0 26 50.000000 5.263158 100.0
4 1961 2 3 50 50 15 21.0 37 58.000000 4.000000 100.0

Furthermore, we create a table with nation information summarized by year and month.

nations_year_month = joined.groupby(["year","month"],as_index=False).agg({'nation': 'nunique'})

Visualizations

Now that most of the data has been processed and is in a usable form, it is time to start visualizing some of it, using Seaborn. We are mainly concerned with the data summarized by year.

#Add color map.
from matplotlib.colors import ListedColormap
cmap = ListedColormap(sns.color_palette("RdYlGn",10))

Success rate

#Let's analize the error/success rate each year
fig = plt.figure(figsize=(20,6))

ax = sns.barplot(x="year",y="launches",data=countries_year,color="green")
sns.barplot(x="year",y="failed",data=countries_year,color="red",ax=ax)

#ax.legend(countries_year,["A","B"])
ax.set_xticklabels([year if year%5==0 else "" for year in countries_year["year"]])
ax.set_title("Number of space launches each year (successful and unsuccessful)", fontsize=20)
Text(0.5, 1.0, 'Number of space launches each year (successful and unsuccessful)')

png

Access to space

Now, to visualize how access to space has grown lately, let’s go back to our original table, group by month and year and count the different countries that launched to space in that month. After that, we’ll build a pivot table, and plot the different countries launching from their soil in a heatmap, and the number of nations puting satellites in orbit.

Let’s first take a look the nation column to see the different nations launching to space.

print(df["nation"].value_counts())
print(df["nation"].value_counts().tail(40))
usa                     3753
ussr                    1847
ussr / russia            904
china                    540
russia                   419
                        ... 
guatemala                  1
usa, uk, netherlands       1
japan, usa                 1
china → indonesia          1
costa rica                 1
Name: nation, Length: 177, dtype: int64
portugal                                 1
bangladesh                               1
israel, france                           1
usa, spain                               1
uk, netherlands                          1
russia, australia                        1
romania                                  1
usa, argentina                           1
ethiopia                                 1
international (eutelsat), qatar          1
usa / international                      1
france, italy                            1
russia → china                           1
greece, saudi arabia                     1
germany, uk, usa                         1
usa, europe                              1
rwanda                                   1
international → afghanistan              1
india, russia                            1
france, usa                              1
canada, usa                              1
germany, luxembourg                      1
netherlands, germany, belgium, israel    1
italy, iraq                              1
europe, russia, usa                      1
turkmenistan                             1
france, europe                           1
turkmenistan, monaco                     1
azerbaijan                               1
italy, france                            1
indonesia, germany                       1
azerbaijan, international                1
usa, canada                              1
usa → canada                             1
algeria, uk                              1
guatemala                                1
usa, uk, netherlands                     1
japan, usa                               1
china → indonesia                        1
costa rica                               1
Name: nation, dtype: int64

Ownership of satellites does not seem so simple. Some satellites belong to several countries, and after some research some seem to be transfered while in orbit (denoted by the arrow -> in the field). We’ll clean this column before ploting. Let’s see how many nations own a single satellite, and how many have been transfered.

#df[ df["nation"].fillna("unknown").str.contains("→") ].count()
multi_ownership = df[ df["nation"].str.count(",") > 0 ]
transferred = df[ df["nation"].str.count("→") > 0 ]

print("Satellites with multiple owners: ", len(multi_ownership) )
print("Transferred satellites: ", len(transferred) )


df["country_count"] = 1 + df["nation"].str.count(",")

fix, axes = plt.subplots(3,1,figsize=(10,10))

axes[0].set_title("Country count")
axes[1].set_title("Payloads with multi-ownership")
axes[2].set_title("Payloads with multi-ownership")

df.hist("country_count",ax=axes[0])
df[ df["country_count"] > 1 ].hist("country_count",ax=axes[1])
df[ df["country_count"] > 2 ].hist("country_count",ax=axes[2])
Satellites with multiple owners:  171
Transferred satellites:  12





array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f8b61437310>],
      dtype=object)

png

Edit: An earlier version didn’t take into account the different nations in this column.

Our goal is to get a sense of how many nations have reached space with a satellite, Many satellites are apparently owned by two nations, but after that. We’ll deal with this adding a “second_nation” column, leave the “nation” column for the first nation, and disregard the few cases with more than three owners. As for satellites that have been transfered, we’ll only consider the first nation at launch as the owner nation.

def clean_country(x):
    return x.strip();

def build_clean_map():
    words_to_remove = ["\d+","\(.*?\)","\(","\)","→.*"]

    clean_map = {word:"" for word in words_to_remove}
    return clean_map

clean_map = build_clean_map();

#joined["second_nation"] = joined["nation"].apply(lambda x: x.split(",")[1] if (x.count(",") > 0) else None)
#joined["first_nation"] = joined["nation"].apply(lambda x: x.split(",")[0]);

joined["second_nation"] = joined["nation"].apply(lambda x: x.strip()).replace(clean_map,regex=True).apply(lambda x: x.strip());
joined["first_nation"] = joined["second_nation"].apply(lambda x: x.split(",")[0])
joined["second_nation"] = joined["second_nation"].apply(lambda x: x.split(",")[1] if (x.count(",") > 0) else None)

nation_payloads_year = joined.groupby("year")["first_nation","second_nation"].apply(lambda x: len(pd.unique(x.values.ravel()).tolist()))

#joined["first_nation"].concat(joined["second_nation"]).unique()

#clean_country("monaco")
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:19: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
launches_per_month = unique_launches.groupby( ["year","month"] ).agg({"country":"count"}).reset_index()

countries_per_month = unique_launches.groupby( ["year","month"] ).agg({"country":"nunique"}).reset_index()
countries_per_month.head()
country_month_pivot_table = countries_per_month.pivot("month","year","country").fillna(0)

launches_per_month = unique_launches.groupby( ["year","month"] ).agg({"country":"count"}).reset_index()
launches_month_pivot_table = countries_per_month.pivot("month","year","country").fillna(0)

#We transform both columns, first nation and second nation into a single list, and count the unique values with len. pd.unique(x.values.ravel()) 
nation_payloads_year = joined.groupby("year")["first_nation","second_nation"].apply(lambda x: len(pd.unique(x.values.ravel()).tolist())).reset_index().rename({0:"nation"},axis=1)
#nation_payloads_year = joined.groupby(["year"],as_index=False).agg({'first_nation': 'nunique'})

fig,(ax,ax2) = plt.subplots(2,1,figsize=(16,8))
sns.heatmap(country_month_pivot_table,ax=ax, square=True,yticklabels=False)
sns.lineplot(x="year",y="nation",data=nation_payloads_year[ nation_payloads_year["year"]<2020 ],ax=ax2,legend="full")

ax.set_xticklabels([year if year%5==0 else "" for year in countries_year["year"]])
ax.set_title("Number of different countries launching to space each month",fontsize=20)
ax.set_yticklabels([0, 1,2,3,4,5,6,7,8])
ax.yaxis.set_ticks([ 0, 1,2,3,4,5,6,7,8])

ax2.set_title("Different nations putting payloads in orbit each year",fontsize=20)


/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:11: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  # This is added back by InteractiveShellApp.init_path()





Text(0.5, 1.0, 'Different nations putting payloads in orbit each year')

png

Let’s see how many different countries have launched from their territory, and how often:

country_launches = unique_launches.groupby("country").agg({"year":["min", "max"],"type":"count"}).reset_index()
country_launches.columns = ['-'.join(col) for col in country_launches.columns.values]
country_launches = country_launches.rename({"type-count":"launches","country-":"country","year-min":"first_launch_year","year-max":"last_launch_year"},axis=1)
country_launches["yearly_rate_since_first"] = country_launches["launches"] / (2020-country_launches["first_launch_year"])
country_launches.sort_values("first_launch_year",ascending=False)
country first_launch_year last_launch_year launches yearly_rate_since_first
13 New Zealand 2017 2020 12 4.000000
16 South Korea 2009 2013 3 0.272727
7 Iran 2008 2020 13 1.083333
12 Marshall Islands 2000 2012 9 0.450000
6 International 1999 2014 36 1.714286
14 North Korea 1998 2016 5 0.227273
2 Brazil 1997 2003 3 0.130435
17 Spain 1997 1997 1 0.043478
11 Kazakhstan 1992 2020 539 19.250000
15 Russia 1992 2020 277 9.892857
8 Israel 1988 2016 10 0.312500
5 India 1979 2019 74 1.804878
4 France 1970 2020 299 5.980000
3 China 1970 2020 361 7.220000
1 Australia 1967 1971 6 0.113208
9 Italy 1967 1988 9 0.169811
10 Japan 1966 2020 122 2.259259
0 Algeria 1965 1967 4 0.072727
18 USA 1957 2020 1720 27.301587
19 USSR / Russia 1957 1991 2453 38.936508

Most countries don’t seem to stick to their programs, or develop their acquired capabilities, since many countries in this list only have a handful of launches . Algeria stopped being a French colony in 1962, but maybe the launches in ‘65 were from a French site? That requires some research. What about the last time a country launched to space?

country_launches.sort_values("last_launch_year",ascending=False)
country first_launch_year last_launch_year launches yearly_rate_since_first
10 Japan 1966 2020 122 2.259259
3 China 1970 2020 361 7.220000
4 France 1970 2020 299 5.980000
7 Iran 2008 2020 13 1.083333
18 USA 1957 2020 1720 27.301587
11 Kazakhstan 1992 2020 539 19.250000
13 New Zealand 2017 2020 12 4.000000
15 Russia 1992 2020 277 9.892857
5 India 1979 2019 74 1.804878
8 Israel 1988 2016 10 0.312500
14 North Korea 1998 2016 5 0.227273
6 International 1999 2014 36 1.714286
16 South Korea 2009 2013 3 0.272727
12 Marshall Islands 2000 2012 9 0.450000
2 Brazil 1997 2003 3 0.130435
17 Spain 1997 1997 1 0.043478
19 USSR / Russia 1957 1991 2453 38.936508
9 Italy 1967 1988 9 0.169811
1 Australia 1967 1971 6 0.113208
0 Algeria 1965 1967 4 0.072727
#Pace of launches for selected countries
#
#country_and_month = unique_launches.groupby( ["country","year","month"] ).size().reset_index()
#
#countries = []
#
#country = country_and_month[ country_and_month["country"]=="China" ].pivot("year","month",0).fillna(0)
#
#sns.heatmap(country,cmap="coolwarm")



argentina = joined[ joined["nation"]=="argentina" ].groupby(["year","month"] ,as_index=False).agg({'nation': 'count'})

argentina.pivot("year","month","nation").fillna(0)
month 1 2 4 5 6 8 9 10 11 12
year
1990 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1996 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1997 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2007 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2013 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2014 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2015 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2016 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
2017 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2018 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2020 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Time to look at the data but grouping by country: let’s see what countries have launched to space in the different decades. We’ll draw several heatmaps, one for each quarter century, and see how the information has changed over time.

decade_tables = []
years = [1950,1975,2000]
fig, axes = plt.subplots(len(years),1,figsize=(20,20))

for i,year in enumerate(years):
    decade_group = unique_launches[ (unique_launches["year"]>=year) & (unique_launches["year"]<year+25) ].groupby(["country","year"])
    decade_pivot_table = decade_group.agg({"month":"count"}).reset_index().pivot("country","year","month")
    decade_tables.append( decade_pivot_table.fillna(0 ) )
    sns.heatmap(decade_pivot_table.fillna(0),ax=axes[i], cmap="coolwarm")                                                                     


png

We can see the trend we saw earlier reflected here as well: before the collapse of the Soviet Union, a great number of launches that died down. The number of launches from the US stays steady, and China really picks up the pace late this past decade. If the countries are grouped more conveniently, it’s possible to plot this in a line chart to see the countries with the most launches more clearly.

countries = unique_launches.groupby( ["year","country"] ).size().reset_index().rename({0:"launches"},axis=1)
top_countries = countries.groupby("country").mean().sort_values("launches", ascending=False).head(8).index

countries["is_top"] =  countries["country"].isin(top_countries)

other_countries = countries[ countries["is_top"]==False ].groupby("year").sum().reset_index()[ ["year","launches"] ]
other_countries["country"] = "Other"

all_countries = pd.concat( [ countries[ countries["country"].isin(top_countries) ][ ["year","launches","country"] ],other_countries] )

fig, axes = plt.subplots(2,1,figsize=(20,20))

axes[0].set_title("Number of launches by country",fontsize=30)
axes[1].set_title("Number of launches by country, 2010 onwards",fontsize=30)
sns.lineplot(x="year",y="launches",hue="country",data=all_countries,ax=axes[0], lw=3)
sns.lineplot(x="year",y="launches",hue="country",data=all_countries[ (all_countries["year"]>2010) & (all_countries["year"]<2020)] ,ax=axes[1], lw=3)


<matplotlib.axes._subplots.AxesSubplot at 0x7f8b60380a90>

png

Launch vehicles, number of satellites per launch

One last look at our data: what about the number of satellites being put to space? Let’s try visualizing how many satellites get to space each year. We’ll plot the number of launches, the year, and the total number of satellites each year, and also visualize the number of different space vehicles employed each year. That will give us some insight about how the private sector is beginning to form the space industry

fig, (ax) = plt.subplots(1,1,figsize=(20,6))
#sns.scatterplot(x="year",y="country", size="country", hue="nation", data=countries_year)
ax.set_title("Number of launches per year, with total number of payloads",fontsize=30)
sns.scatterplot(x="year",y="launches", size="payload_total", color="green", palette="coolwarm", data=countries_year,ax=ax, sizes=(1,600), marker="o")
fig.show()

png

A perhaps better way to visualize this is to plot the payload divided by the number of launches.

#Let's start plotting the number of launches we had each year, and how many different countries launched into space.
#We'll also plot how many different vehicles were used each year.
fig, (ax,ax2) = plt.subplots(2,1,figsize=(20,6))
#sns.scatterplot(x="year",y="country", size="country", hue="nation", data=countries_year)
ax.set_title("Number of satellites/payload put in orbit per year")
successful_countries_year["payload_per_launch_avg"] = successful_countries_year["payload_total"] / successful_countries_year["launches"]

chart = sns.barplot(x="year",y="payload_total", data=successful_countries_year,ax=ax,color="#8080DA")
ax.set_xticklabels([year if year%5==0 else "" for year in countries_year["year"]])

ax2.set_title("Average number of satellites/payloads per launch")
chart = sns.barplot(x="year",y="payload_per_launch_avg", data=successful_countries_year,ax=ax2,color="#8080DA")
ax2.set_xticklabels([year if year%5==0 else "" for year in countries_year["year"]])

fig.show()

png

There was some increase before the introduction of Starlink by SpaceX, but so far in 2020, the difference has been staggering.

What we’ve learned, and what’s next

After putting our data in order, we extracted a few interesting insights:

  • We learned that peak of launches happened in the Cold War, driven mainly by the then USSR.
  • Launches have increased in pace this past decade, with China specially active.
  • Not many other countries have acquired launch capabilities. Only Iran, South Korea, and New Zealand, since the year 2000. (The Marshall islands launch site belongs to the United States).
  • Many countries are unable to sustain the pace of launches once they get to orbit. However, more research and domain knowledge is needed there: after further research, the 1997 launch from Spain consisted in launching a rocket from an aircraft, that took off from Gran Canaria. There might be some other cases similar to that one.
  • However, the number of different countries placing payloads in orbit increased markedly, specially in these past 15 years. The last few years saw 50 different countries consistently putting at least a satellite in orbit.

What would we like to find out next? There are many interesting questions that can answer with this data, if we properly handle the columns that were disregarded. Furthermore, after our initial dive, some other questions arise:

  • What’s the use of the satellites currently in space? (military, techonology)
  • What’s the average mass of the payloads in orbit? Satellites supposedly got smaller these last few years.
  • What about private companies? Can we tell when they started launching into space?
  • What’s the most reliable launch vehicle?
  • It would be nice to visualize some of this information in a map.
  • Do we have any information about humans in space in our data?
  • What about the launch vehicles employed? What are the different methods of launching payloads to space?