import pandas as pd
Javascript Object Notation(JSON) is a widely used format for storing and exchanging data. Coming from the relational database, it could be difficult to understand NoSQL databases that use JSON to store data and similarly REST API’s response. JSON is also used in storing football event data. It allows easy addition of features in the future.
Though JSON format allows for easier exchange of data, for analysis, a tabular form would be appropriate. A JSON structure can be of two forms: a JSON object and list of JSON objects. Since our programming language of choice is Python, those structures can be somewhat called as a dictionary object or list of dicts.
Importing pandas library,
1 Flattening a simple JSON
A dict
Let us consider a simple dictionary: 3 keys and their respective values.
= {
viv "player_id" : 15623,
"player_name" : "Vivianne Miedema",
"jersey_number" : 11}
viv
{'player_id': 15623, 'player_name': 'Vivianne Miedema', 'jersey_number': 11}
We use the json_normalize
API2 to flatten a JSON dict.
= pd.json_normalize(viv);df df
player_id | player_name | jersey_number | |
---|---|---|---|
0 | 15623 | Vivianne Miedema | 11 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player_id 1 non-null int64
1 player_name 1 non-null object
2 jersey_number 1 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
Side Note: If the data contains something that is not compatible with python, in this case a
null
variable, there are two choices:
- Change
null
toNone
- Pass the data through
json.loads
function
Change null
to None
= None
null = { "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}
viv1 viv1
{'player_id': 15623,
'player_name': 'Vivianne Miedema',
'jersey_number': 11,
'player_nickname': None}
Make data as string and pass to json.loads
import json
= '{ "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}'
viv1 = json.loads(viv1)
viv1 viv1
{'player_id': 15623,
'player_name': 'Vivianne Miedema',
'jersey_number': 11,
'player_nickname': None}
1.1 A list of dicts
= [
player_list "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
{ "player_id" : 10658, "player_name" : "Danielle van de Donk", "jersey_number" : 7, "player_nickname" : null }
{
] pd.json_normalize(player_list)
player_id | player_name | jersey_number | player_nickname | |
---|---|---|---|---|
0 | 15623 | Vivianne Miedema | 11 | None |
1 | 10658 | Danielle van de Donk | 7 | None |
We have the JSON list of dicts in a tabular form. All the keys become columns and their values as entries.
When we flattern a list with a key-value pair missing for an entry, instead of an error, NaN(not a number) is stored.
= [
player_list "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
{ "player_id" : 10658, "player_name" : "Danielle van de Donk"}
{
] pd.json_normalize(player_list)
player_id | player_name | jersey_number | player_nickname | |
---|---|---|---|---|
0 | 15623 | Vivianne Miedema | 11.0 | NaN |
1 | 10658 | Danielle van de Donk | NaN | NaN |
Note: See how player_nickname
when not specified also turns to NaN from None.
2 Flattening a multi-level JSON
2.1 A simple dict
= {
at_kick0ff "id":"d712fb93-c464-4621-98ba-f2bdcd5641db",
"timestamp":"00:00:00.000",
"duration":0.0,
"lineup":{
"player":{
"id":15623,
"name":"Vivianne Miedema"
},"position":{
"id":23,
"name":"Center Forward"
},"jersey_number":11
}
} at_kick0ff
{'id': 'd712fb93-c464-4621-98ba-f2bdcd5641db',
'timestamp': '00:00:00.000',
'duration': 0.0,
'lineup': {'player': {'id': 15623, 'name': 'Vivianne Miedema'},
'position': {'id': 23, 'name': 'Center Forward'},
'jersey_number': 11}}
pd.json_normalize(at_kick0ff)
id | timestamp | duration | lineup.player.id | lineup.player.name | lineup.position.id | lineup.position.name | lineup.jersey_number | |
---|---|---|---|---|---|---|---|---|
0 | d712fb93-c464-4621-98ba-f2bdcd5641db | 00:00:00.000 | 0.0 | 15623 | Vivianne Miedema | 23 | Center Forward | 11 |
You can see that lineup dictionary key’s nested key-value pairs have been expanded into individual columns. If you feel that is unnecessary, we can restrict expansion by using max_level argument. With max_level=1
, the flattening goes one level deeper.
=1) pd.json_normalize(at_kick0ff, max_level
id | timestamp | duration | lineup.player | lineup.position | lineup.jersey_number | |
---|---|---|---|---|---|---|
0 | d712fb93-c464-4621-98ba-f2bdcd5641db | 00:00:00.000 | 0.0 | {'id': 15623, 'name': 'Vivianne Miedema'} | {'id': 23, 'name': 'Center Forward'} | 11 |
2.2 A list of dicts
= [
first_pass
{"id":"15758edb-58cd-49c4-a817-d2ef48ba3bcf",
"timestamp":"00:00:00.504",
"type":{
"id":30,
"name":"Pass"
},"play_pattern":{
"id":9,
"name":"From Kick Off"
},"player":{
"id":15623,
"name":"Vivianne Miedema"
},"pass":{
"recipient":{
"id":10666,
"name":"Dominique Johanna Anna Bloodworth"
},"length":25.455845,
"angle":-2.3561945,
"height":{
"id":1,
"name":"Ground Pass"
},"end_location":[
42.0,
22.0
]
}
}, {"id" : "ab5674a4-e824-4143-9f6f-3f1645557413",
"timestamp" : "00:00:04.201",
"type" : {
"id" : 30,
"name" : "Pass"
},"play_pattern" : {
"id" : 9,
"name" : "From Kick Off"
},"player" : {
"id" : 10666,
"name" : "Dominique Johanna Anna Bloodworth"
},"location" : [ 45.0, 29.0 ],
"duration" : 1.795201,
"pass" : {
"length" : 51.62364,
"angle" : 0.55038595,
"height" : {
"id" : 3,
"name" : "High Pass"
},"end_location" : [ 89.0, 56.0 ]
}
}
]
pd.json_normalize(first_pass)
id | timestamp | type.id | type.name | play_pattern.id | play_pattern.name | player.id | player.name | pass.recipient.id | pass.recipient.name | pass.length | pass.angle | pass.height.id | pass.height.name | pass.end_location | location | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15758edb-58cd-49c4-a817-d2ef48ba3bcf | 00:00:00.504 | 30 | Pass | 9 | From Kick Off | 15623 | Vivianne Miedema | 10666.0 | Dominique Johanna Anna Bloodworth | 25.455845 | -2.356194 | 1 | Ground Pass | [42.0, 22.0] | NaN | NaN |
1 | ab5674a4-e824-4143-9f6f-3f1645557413 | 00:00:04.201 | 30 | Pass | 9 | From Kick Off | 10666 | Dominique Johanna Anna Bloodworth | NaN | NaN | 51.623640 | 0.550386 | 3 | High Pass | [89.0, 56.0] | [45.0, 29.0] | 1.795201 |
Limiting the levels…
=0) pd.json_normalize(first_pass, max_level
id | timestamp | type | play_pattern | player | pass | location | duration | |
---|---|---|---|---|---|---|---|---|
0 | 15758edb-58cd-49c4-a817-d2ef48ba3bcf | 00:00:00.504 | {'id': 30, 'name': 'Pass'} | {'id': 9, 'name': 'From Kick Off'} | {'id': 15623, 'name': 'Vivianne Miedema'} | {'recipient': {'id': 10666, 'name': 'Dominique... | NaN | NaN |
1 | ab5674a4-e824-4143-9f6f-3f1645557413 | 00:00:04.201 | {'id': 30, 'name': 'Pass'} | {'id': 9, 'name': 'From Kick Off'} | {'id': 10666, 'name': 'Dominique Johanna Anna ... | {'length': 51.62364, 'angle': 0.55038595, 'hei... | [45.0, 29.0] | 1.795201 |
3 Flattening a JSON nested list
3.1 A simple dict
For this case, let us consider a simpler example than of football event data. The key info
has list of dictionaries inside its structure. We call it nested dict.
= {
awfc 'team': 'AWFC',
'location': 'London',
'ranking': 1,
'info': {
'manager': 'Joe',
'contacts': {
'email': {
'coaching': 'joe@afc.com',
'general': 'info@afc.com'
},'tel': '123456789',
}
},'players': [
'name': 'Viv' },
{ 'name': 'DvD' },
{ 'name': 'Kim' }
{
],;awfc }
{'team': 'AWFC',
'location': 'London',
'ranking': 1,
'info': {'manager': 'Joe',
'contacts': {'email': {'coaching': 'joe@afc.com', 'general': 'info@afc.com'},
'tel': '123456789'}},
'players': [{'name': 'Viv'}, {'name': 'DvD'}, {'name': 'Kim'}]}
The players column has a list of dicts. So, we can flatten that column using record_path argument.
=['players']) pd.json_normalize(awfc, record_path
name | |
---|---|
0 | Viv |
1 | DvD |
2 | Kim |
But, making a separate table with no reference id has no meaning. To prevent that we can append revelant columns to the new table using meta argument. Here we want their team and Telephone number. The tel key lies within info->contacts->tel
. So, we need provide that path like so ['info', 'contacts', 'tel']
.
=['players'], meta=['team',['info', 'contacts', 'tel']]) pd.json_normalize(awfc, record_path
name | team | info.contacts.tel | |
---|---|---|---|
0 | Viv | AWFC | 123456789 |
1 | DvD | AWFC | 123456789 |
2 | Kim | AWFC | 123456789 |
The order in which those paths are mentioned, the order in which those columns are appended.
=['players'], meta=['team',['info', 'contacts', 'tel'],['info', 'manager']]) pd.json_normalize(awfc, record_path
name | team | info.contacts.tel | info.manager | |
---|---|---|---|---|
0 | Viv | AWFC | 123456789 | Joe |
1 | DvD | AWFC | 123456789 | Joe |
2 | Kim | AWFC | 123456789 | Joe |
3.2 A list of dicts
= [
json_list
{ 'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},'players': [
{ 'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{ 'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{ 'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'physio': 'aaaa',
'doctor': 'bbbb'
}
},'players': [
'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
{
]
},
]
pd.json_normalize(json_list)
team | colour | players | info.staff.physio | info.staff.doctor | |
---|---|---|---|---|---|
0 | arsenal | red-white | [{'name': 'Viv', 'sex': 'F', 'stats': {'goals'... | xxxx | yyyy |
1 | city | blue | [{'name': 'Steph', 'sex': 'F'}, {'name': 'Lucy... | aaaa | bbbb |
=['players']) pd.json_normalize(json_list, record_path
name | sex | stats.goals | stats.assists | |
---|---|---|---|---|
0 | Viv | F | 101.0 | 40.0 |
1 | Beth | F | 60.0 | 25.0 |
2 | Steph | F | NaN | NaN |
3 | Lucy | F | NaN | NaN |
How about we now append the players’ team, colour, and their physio.
pd.json_normalize(
json_list, =['players'],
record_path =['team', 'colour', ['info', 'staff', 'physio']]
meta )
name | sex | stats.goals | stats.assists | team | colour | info.staff.physio | |
---|---|---|---|---|---|---|---|
0 | Viv | F | 101.0 | 40.0 | arsenal | red-white | xxxx |
1 | Beth | F | 60.0 | 25.0 | arsenal | red-white | xxxx |
2 | Steph | F | NaN | NaN | city | blue | aaaa |
3 | Lucy | F | NaN | NaN | city | blue | aaaa |
4 Ignoring key errors
= [
json_list
{ 'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},'players': [
{ 'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{ 'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{ 'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'doctor': 'bbbb'
}
},'players': [
'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
{
]
}, ]
Notice that the key physio is missing from the entry team=city
. What happens if we try to access physio key inside meta
?
pd.json_normalize(
json_list, =['players'],
record_path =['team', 'colour', ['info', 'staff', 'physio']],
meta )
KeyError: "Key 'physio' not found. To replace missing values of 'physio' with np.nan, pass in errors='ignore'"
How come stats.goals
and stats.assists
didn’t generate an error but that above does? Because, the meta argument expects values to be present for listed keys in meta by default. We can ignore those errors(as suggested) using errors='ignore'
pd.json_normalize(
json_list, =['players'],
record_path =['team', 'colour', ['info', 'staff', 'physio']],
meta='ignore'
errors )
name | sex | stats.goals | stats.assists | team | colour | info.staff.physio | |
---|---|---|---|---|---|---|---|
0 | Viv | F | 101.0 | 40.0 | arsenal | red-white | xxxx |
1 | Beth | F | 60.0 | 25.0 | arsenal | red-white | xxxx |
2 | Steph | F | NaN | NaN | city | blue | NaN |
3 | Lucy | F | NaN | NaN | city | blue | NaN |
5 Custom separator sep
We notice that by default pandas uses .
to indicate the direction of the path. We can change that using the sep
argument.
Tip: Usually an
underscore
is used instead of.
= [
json_list
{ 'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},'players': [
{ 'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{ 'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{ 'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'physio': 'aaaa',
'doctor': 'bbbb'
}
},'players': [
'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
{
]
}, ]
pd.json_normalize(
json_list, =['players'],
record_path =['team', 'colour', ['info', 'staff', 'physio']],
meta='->'
sep )
name | sex | stats->goals | stats->assists | team | colour | info->staff->physio | |
---|---|---|---|---|---|---|---|
0 | Viv | F | 101.0 | 40.0 | arsenal | red-white | xxxx |
1 | Beth | F | 60.0 | 25.0 | arsenal | red-white | xxxx |
2 | Steph | F | NaN | NaN | city | blue | aaaa |
3 | Lucy | F | NaN | NaN | city | blue | aaaa |
6 Adding context to record and meta data using record_prefix and meta_prefix
pd.json_normalize(
json_list, =['players'],
record_path=['team', 'colour', ['info', 'staff', 'physio']],
meta='meta-',
meta_prefix='player-',
record_prefix='->'
sep )
player-name | player-sex | player-stats->goals | player-stats->assists | meta-team | meta-colour | meta-info->staff->physio | |
---|---|---|---|---|---|---|---|
0 | Viv | F | 101.0 | 40.0 | arsenal | red-white | xxxx |
1 | Beth | F | 60.0 | 25.0 | arsenal | red-white | xxxx |
2 | Steph | F | NaN | NaN | city | blue | aaaa |
3 | Lucy | F | NaN | NaN | city | blue | aaaa |
7 Working with a local file
In most scenarios, we won’t be making new JSON object ourselves instead use JSON formatted files. We make use python’s json
module and read the file, then use pandas’ json_normalize
to flatten it into a dataframe.
import json
# load data using Python JSON module
with open('movies.json') as f:
= json.load(f)
data
# Normalizing data
pd.json_normalize(data)
Title | US Gross | Worldwide Gross | US DVD Sales | Production Budget | Release Date | MPAA Rating | Running Time min | Distributor | Source | Major Genre | Creative Type | Director | Rotten Tomatoes Rating | IMDB Rating | IMDB Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083 | 146083 | NaN | 8000000 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 |
1 | First Love, Last Rites | 10876 | 10876 | NaN | 300000 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 |
2 | I Married a Strange Person | 203134 | 203134 | NaN | 250000 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 |
3 | Four Rooms | 4301000 | 4301000 | NaN | 4000000 | Dec 25 1995 | R | NaN | Miramax | Original Screenplay | Comedy | Contemporary Fiction | Robert Rodriguez | 14.0 | 6.4 | 34328.0 |
4 | The Four Seasons | 42488161 | 42488161 | NaN | 6500000 | May 22 1981 | None | NaN | Universal | Original Screenplay | Comedy | Contemporary Fiction | Alan Alda | 71.0 | 7.0 | 1814.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | Big Things | 0 | 0 | NaN | 50000 | Dec 31 2009 | None | NaN | None | None | None | None | None | NaN | NaN | NaN |
64 | Bogus | 4357406 | 4357406 | NaN | 32000000 | Sep 06 1996 | PG | NaN | Warner Bros. | Original Screenplay | Comedy | Fantasy | Norman Jewison | 40.0 | 4.8 | 2742.0 |
65 | Beverly Hills Cop | 234760478 | 316300000 | NaN | 15000000 | Dec 05 1984 | None | NaN | Paramount Pictures | Original Screenplay | Action | Contemporary Fiction | Martin Brest | 83.0 | 7.3 | 45065.0 |
66 | Beverly Hills Cop II | 153665036 | 276665036 | NaN | 20000000 | May 20 1987 | R | NaN | Paramount Pictures | Original Screenplay | Action | Contemporary Fiction | Tony Scott | 46.0 | 6.1 | 29712.0 |
67 | Beverly Hills Cop III | 42586861 | 119180938 | NaN | 50000000 | May 25 1994 | R | NaN | Paramount Pictures | Original Screenplay | Action | Contemporary Fiction | John Landis | 10.0 | 5.0 | 21199.0 |
68 rows × 16 columns
8 Working with URL
Reading a JSON file from an url needs an extra module in requests
as any data from the Internet carries overheads that are necessary for efficient exchange of information(REST API). So, in order to read the file contents, we call upon requests’ text
attribute which fetches the contents of the file.
Here, we use json.loads
and not json.load
as loads
function expects contents(string) rather than a file pointer. If looked closely into the json
module, the load
calls loads
using read()
on the file.
import requests
= 'https://vega.github.io/vega-datasets/data/cars.json'
URL
= json.loads(requests.get(URL).text)
data pd.json_normalize(data)
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |
1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |
2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |
4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
401 | ford mustang gl | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 1982-01-01 | USA |
402 | vw pickup | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 1982-01-01 | Europe |
403 | dodge rampage | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 1982-01-01 | USA |
404 | ford ranger | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 1982-01-01 | USA |
405 | chevy s-10 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 1982-01-01 | USA |
406 rows × 9 columns
9 Conclusion
We saw the use of json_normalize
function in pandas library. It helps take a JSON data, flatten it, and make it as a dataframe for easier analysis.