Using json_normalize function
A tutorial with examples on flattening JSON object using json_normalize pandas function
Javascript Object Notation(JSON) is a widely used format for storing and exchanging data. Coming from the relational database, it could be difficult to understand NoSQL databases that use JSON to store data and similarly REST API's response. JSON is also used in storing football event data. It allows easy addition of features in the future.
Though JSON format allows for easier exchange of data, for analysis, a tabular form would be appropriate. A JSON structure can be of two forms: a JSON object and list of JSON objects. Since our programming language of choice is Python, those structures can be somewhat called as a dictionary object or list of dicts.
import pandas as pd
viv = {
"player_id" : 15623,
"player_name" : "Vivianne Miedema",
"jersey_number" : 11}
viv
We use the json_normalize
API to flatten a JSON dict.
df = pd.json_normalize(viv);df
df.info()
Side Note:If the data contains something that is not compatible with python, in this case a
null
variable, there are two choices:
- Change
null
toNone
- Pass the data through
json.loads
function
Change null
to None
null = None
viv1 = { "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}
viv1
Make data as string and pass to json.loads
import json
viv1 = '{ "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}'
viv1 = json.loads(viv1)
viv1
player_list = [
{ "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
{ "player_id" : 10658, "player_name" : "Danielle van de Donk", "jersey_number" : 7, "player_nickname" : null }
]
pd.json_normalize(player_list)
We have the JSON list of dicts in a tabular form. All the keys become columns and their values as entries.
When we flattern a list with a key-value pair missing for an entry, instead of an error, NaN(not a number) is stored.
player_list = [
{ "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
{ "player_id" : 10658, "player_name" : "Danielle van de Donk"}
]
pd.json_normalize(player_list)
Note: See how player_nickname
when not specified also turns to NaN from None.
at_kick0ff = {
"id":"d712fb93-c464-4621-98ba-f2bdcd5641db",
"timestamp":"00:00:00.000",
"duration":0.0,
"lineup":{
"player":{
"id":15623,
"name":"Vivianne Miedema"
},
"position":{
"id":23,
"name":"Center Forward"
},
"jersey_number":11
}
}
at_kick0ff
pd.json_normalize(at_kick0ff)
You can see that lineup dictionary key's nested key-value pairs have been expanded into individual columns. If you feel that is unnecessary, we can restrict expansion by using max_level argument. With max_level=1
, the flattening goes one level deeper.
pd.json_normalize(at_kick0ff, max_level=1)
first_pass = [
{
"id":"15758edb-58cd-49c4-a817-d2ef48ba3bcf",
"timestamp":"00:00:00.504",
"type":{
"id":30,
"name":"Pass"
},
"play_pattern":{
"id":9,
"name":"From Kick Off"
},
"player":{
"id":15623,
"name":"Vivianne Miedema"
},
"pass":{
"recipient":{
"id":10666,
"name":"Dominique Johanna Anna Bloodworth"
},
"length":25.455845,
"angle":-2.3561945,
"height":{
"id":1,
"name":"Ground Pass"
},
"end_location":[
42.0,
22.0
]
}
}, {
"id" : "ab5674a4-e824-4143-9f6f-3f1645557413",
"timestamp" : "00:00:04.201",
"type" : {
"id" : 30,
"name" : "Pass"
},
"play_pattern" : {
"id" : 9,
"name" : "From Kick Off"
},
"player" : {
"id" : 10666,
"name" : "Dominique Johanna Anna Bloodworth"
},
"location" : [ 45.0, 29.0 ],
"duration" : 1.795201,
"pass" : {
"length" : 51.62364,
"angle" : 0.55038595,
"height" : {
"id" : 3,
"name" : "High Pass"
},
"end_location" : [ 89.0, 56.0 ]
}
}
]
pd.json_normalize(first_pass)
Limiting the levels...
pd.json_normalize(first_pass, max_level=0)
awfc = {
'team': 'AWFC',
'location': 'London',
'ranking': 1,
'info': {
'manager': 'Joe',
'contacts': {
'email': {
'coaching': 'joe@afc.com',
'general': 'info@afc.com'
},
'tel': '123456789',
}
},
'players': [
{ 'name': 'Viv' },
{ 'name': 'DvD' },
{ 'name': 'Kim' }
],
};awfc
The players column has a list of dicts. So, we can flatten that column using record_path argument.
pd.json_normalize(awfc, record_path=['players'])
But, making a separate table with no reference id has no meaning. To prevent that we can append revelant columns to the new table using meta argument. Here we want their team and Telephone number. The tel key lies within info->contacts->tel
. So, we need provide that path like so ['info', 'contacts', 'tel']
.
pd.json_normalize(awfc, record_path=['players'], meta=['team',['info', 'contacts', 'tel']])
The order in which those paths are mentioned, the order in which those columns are appended.
pd.json_normalize(awfc, record_path=['players'], meta=['team',['info', 'contacts', 'tel'],['info', 'manager']])
json_list = [
{
'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},
'players': [
{
'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{
'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{
'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'physio': 'aaaa',
'doctor': 'bbbb'
}
},
'players': [
{ 'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
]
},
]
pd.json_normalize(json_list)
pd.json_normalize(json_list, record_path =['players'])
How about we now append the players' team, colour, and their physio.
pd.json_normalize(
json_list,
record_path =['players'],
meta=['team', 'colour', ['info', 'staff', 'physio']]
)
json_list = [
{
'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},
'players': [
{
'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{
'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{
'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'doctor': 'bbbb'
}
},
'players': [
{ 'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
]
},
]
Notice that the key physio is missing from the entry team=city
. What happens if we try to access physio key inside meta
?
pd.json_normalize(
json_list,
record_path =['players'],
meta=['team', 'colour', ['info', 'staff', 'physio']],
)
How come stats.goals
and stats.assists
didn't generate an error but that above does? Because, the meta argument expects values to be present for listed keys in meta by default. We can ignore those errors(as suggested) using errors='ignore'
pd.json_normalize(
json_list,
record_path =['players'],
meta=['team', 'colour', ['info', 'staff', 'physio']],
errors='ignore'
)
Tip:Usually an
underscore
is used instead of.
json_list = [
{
'team': 'arsenal',
'colour': 'red-white',
'info': {
'staff': {
'physio': 'xxxx',
'doctor': 'yyyy'
}
},
'players': [
{
'name': 'Viv',
'sex': 'F',
'stats': { 'goals': 101, 'assists': 40 }
},
{
'name': 'Beth',
'sex': 'F',
'stats': { 'goals': 60, 'assists': 25 }
},
]
},
{
'team': 'city',
'colour': 'blue',
'info': {
'staff': {
'physio': 'aaaa',
'doctor': 'bbbb'
}
},
'players': [
{ 'name': 'Steph', 'sex': 'F' },
{ 'name': 'Lucy', 'sex': 'F' },
]
},
]
pd.json_normalize(
json_list,
record_path =['players'],
meta=['team', 'colour', ['info', 'staff', 'physio']],
sep='->'
)
pd.json_normalize(
json_list,
record_path=['players'],
meta=['team', 'colour', ['info', 'staff', 'physio']],
meta_prefix='meta-',
record_prefix='player-',
sep='->'
)
In most scenarios, we won't be making new JSON object ourselves instead use JSON formatted files. We make use python's json
module and read the file, then use pandas' json_normalize
to flatten it into a dataframe.
import json
# load data using Python JSON module
with open('my_attachment/movies.json') as f:
data = json.load(f)
# Normalizing data
pd.json_normalize(data)
Reading a JSON file from an url needs an extra module in requests
as any data from the Internet carries overheads that are necessary for efficient exchange of information(REST API). So, in order to read the file contents, we call upon requests' text
attribute which fetches the contents of the file.
Here, we use json.loads
and not json.load
as loads
function expects contents(string) rather than a file pointer. If looked closely into the json
module, the load
calls loads
using read()
on the file.
import requests
URL = 'https://vega.github.io/vega-datasets/data/cars.json'
data = json.loads(requests.get(URL).text)
pd.json_normalize(data)