Javascript Object Notation(JSON) is a widely used format for storing and exchanging data. Coming from the relational database, it could be difficult to understand NoSQL databases that use JSON to store data and similarly REST API's response. JSON is also used in storing football event data. It allows easy addition of features in the future.

Though JSON format allows for easier exchange of data, for analysis, a tabular form would be appropriate. A JSON structure can be of two forms: a JSON object and list of JSON objects. Since our programming language of choice is Python, those structures can be somewhat called as a dictionary object or list of dicts.

Importing pandas library

import pandas as pd

1. Flattening a simple JSON

A dict

Let us consider a simple dictionary: 3 keys and their respective values.

viv = {
    "player_id" : 15623, 
    "player_name" : "Vivianne Miedema", 
    "jersey_number" : 11}
viv
{'player_id': 15623, 'player_name': 'Vivianne Miedema', 'jersey_number': 11}

We use the json_normalize API to flatten a JSON dict.

df = pd.json_normalize(viv);df
player_id player_name jersey_number
0 15623 Vivianne Miedema 11
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   player_id      1 non-null      int64 
 1   player_name    1 non-null      object
 2   jersey_number  1 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes

Side Note:If the data contains something that is not compatible with python, in this case a null variable, there are two choices:

  • Change null to None
  • Pass the data through json.loads function

Change null to None

null = None
viv1 = { "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}
viv1
{'player_id': 15623,
 'player_name': 'Vivianne Miedema',
 'jersey_number': 11,
 'player_nickname': None}

Make data as string and pass to json.loads

import json
viv1 = '{ "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null}'
viv1 = json.loads(viv1)
viv1
{'player_id': 15623,
 'player_name': 'Vivianne Miedema',
 'jersey_number': 11,
 'player_nickname': None}

A list of dicts

player_list = [
    { "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
    { "player_id" : 10658, "player_name" : "Danielle van de Donk", "jersey_number" : 7, "player_nickname" : null }
]
pd.json_normalize(player_list)
player_id player_name jersey_number player_nickname
0 15623 Vivianne Miedema 11 None
1 10658 Danielle van de Donk 7 None

We have the JSON list of dicts in a tabular form. All the keys become columns and their values as entries.

When we flattern a list with a key-value pair missing for an entry, instead of an error, NaN(not a number) is stored.

player_list = [
    { "player_id" : 15623, "player_name" : "Vivianne Miedema", "jersey_number" : 11, "player_nickname" : null },
    { "player_id" : 10658, "player_name" : "Danielle van de Donk"}
]
pd.json_normalize(player_list)
player_id player_name jersey_number player_nickname
0 15623 Vivianne Miedema 11.0 NaN
1 10658 Danielle van de Donk NaN NaN

Note: See how player_nickname when not specified also turns to NaN from None.

2. Flattening a multi-level JSON

A simple dict

at_kick0ff = {
  "id":"d712fb93-c464-4621-98ba-f2bdcd5641db",
  "timestamp":"00:00:00.000",
  "duration":0.0,
  "lineup":{
      "player":{
        "id":15623,
        "name":"Vivianne Miedema"
      },
      "position":{
        "id":23,
        "name":"Center Forward"
      },
      "jersey_number":11
    }
}
at_kick0ff
{'id': 'd712fb93-c464-4621-98ba-f2bdcd5641db',
 'timestamp': '00:00:00.000',
 'duration': 0.0,
 'lineup': {'player': {'id': 15623, 'name': 'Vivianne Miedema'},
  'position': {'id': 23, 'name': 'Center Forward'},
  'jersey_number': 11}}
pd.json_normalize(at_kick0ff)
id timestamp duration lineup.player.id lineup.player.name lineup.position.id lineup.position.name lineup.jersey_number
0 d712fb93-c464-4621-98ba-f2bdcd5641db 00:00:00.000 0.0 15623 Vivianne Miedema 23 Center Forward 11

You can see that lineup dictionary key's nested key-value pairs have been expanded into individual columns. If you feel that is unnecessary, we can restrict expansion by using max_level argument. With max_level=1, the flattening goes one level deeper.

pd.json_normalize(at_kick0ff, max_level=1)
id timestamp duration lineup.player lineup.position lineup.jersey_number
0 d712fb93-c464-4621-98ba-f2bdcd5641db 00:00:00.000 0.0 {'id': 15623, 'name': 'Vivianne Miedema'} {'id': 23, 'name': 'Center Forward'} 11

A list of dicts

first_pass = [
  {
    "id":"15758edb-58cd-49c4-a817-d2ef48ba3bcf",
    "timestamp":"00:00:00.504",
    "type":{
      "id":30,
      "name":"Pass"
    },
    "play_pattern":{
      "id":9,
      "name":"From Kick Off"
    },
    "player":{
      "id":15623,
      "name":"Vivianne Miedema"
    },
    "pass":{
      "recipient":{
        "id":10666,
        "name":"Dominique Johanna Anna Bloodworth"
      },
      "length":25.455845,
      "angle":-2.3561945,
      "height":{
        "id":1,
        "name":"Ground Pass"
      },
      "end_location":[
        42.0,
        22.0
      ]
    }
  }, {
  "id" : "ab5674a4-e824-4143-9f6f-3f1645557413",
  "timestamp" : "00:00:04.201",
  "type" : {
    "id" : 30,
    "name" : "Pass"
  },
  "play_pattern" : {
    "id" : 9,
    "name" : "From Kick Off"
  },
  "player" : {
    "id" : 10666,
    "name" : "Dominique Johanna Anna Bloodworth"
  },
  "location" : [ 45.0, 29.0 ],
  "duration" : 1.795201,
  "pass" : {
    "length" : 51.62364,
    "angle" : 0.55038595,
    "height" : {
      "id" : 3,
      "name" : "High Pass"
    },
    "end_location" : [ 89.0, 56.0 ]
  }
}
]
    
pd.json_normalize(first_pass)
id timestamp type.id type.name play_pattern.id play_pattern.name player.id player.name pass.recipient.id pass.recipient.name pass.length pass.angle pass.height.id pass.height.name pass.end_location location duration
0 15758edb-58cd-49c4-a817-d2ef48ba3bcf 00:00:00.504 30 Pass 9 From Kick Off 15623 Vivianne Miedema 10666.0 Dominique Johanna Anna Bloodworth 25.455845 -2.356194 1 Ground Pass [42.0, 22.0] NaN NaN
1 ab5674a4-e824-4143-9f6f-3f1645557413 00:00:04.201 30 Pass 9 From Kick Off 10666 Dominique Johanna Anna Bloodworth NaN NaN 51.623640 0.550386 3 High Pass [89.0, 56.0] [45.0, 29.0] 1.795201

Limiting the levels...

pd.json_normalize(first_pass, max_level=0)
id timestamp type play_pattern player pass location duration
0 15758edb-58cd-49c4-a817-d2ef48ba3bcf 00:00:00.504 {'id': 30, 'name': 'Pass'} {'id': 9, 'name': 'From Kick Off'} {'id': 15623, 'name': 'Vivianne Miedema'} {'recipient': {'id': 10666, 'name': 'Dominique... NaN NaN
1 ab5674a4-e824-4143-9f6f-3f1645557413 00:00:04.201 {'id': 30, 'name': 'Pass'} {'id': 9, 'name': 'From Kick Off'} {'id': 10666, 'name': 'Dominique Johanna Anna ... {'length': 51.62364, 'angle': 0.55038595, 'hei... [45.0, 29.0] 1.795201

3. Flattening a JSON nested list

A simple dict

For this case, let us consider a simpler example than of football event data. The key info has list of dictionaries inside its structure. We call it nested dict.

awfc = {
    'team': 'AWFC',
    'location': 'London',
    'ranking': 1,
    'info': {
        'manager': 'Joe',
        'contacts': {
          'email': {
              'coaching': 'joe@afc.com',
              'general': 'info@afc.com'
          },
          'tel': '123456789',
      }
    },
    'players': [
      { 'name': 'Viv' },
      { 'name': 'DvD' },
      { 'name': 'Kim' }
    ],
};awfc
{'team': 'AWFC',
 'location': 'London',
 'ranking': 1,
 'info': {'manager': 'Joe',
  'contacts': {'email': {'coaching': 'joe@afc.com', 'general': 'info@afc.com'},
   'tel': '123456789'}},
 'players': [{'name': 'Viv'}, {'name': 'DvD'}, {'name': 'Kim'}]}

The players column has a list of dicts. So, we can flatten that column using record_path argument.

pd.json_normalize(awfc, record_path=['players'])
name
0 Viv
1 DvD
2 Kim

But, making a separate table with no reference id has no meaning. To prevent that we can append revelant columns to the new table using meta argument. Here we want their team and Telephone number. The tel key lies within info->contacts->tel. So, we need provide that path like so ['info', 'contacts', 'tel'].

pd.json_normalize(awfc, record_path=['players'], meta=['team',['info', 'contacts', 'tel']])
name team info.contacts.tel
0 Viv AWFC 123456789
1 DvD AWFC 123456789
2 Kim AWFC 123456789

The order in which those paths are mentioned, the order in which those columns are appended.

pd.json_normalize(awfc, record_path=['players'], meta=['team',['info', 'contacts', 'tel'],['info', 'manager']])
name team info.contacts.tel info.manager
0 Viv AWFC 123456789 Joe
1 DvD AWFC 123456789 Joe
2 Kim AWFC 123456789 Joe

A list of dicts

json_list = [
    { 
        'team': 'arsenal', 
        'colour': 'red-white',
        'info': {
            'staff': { 
                'physio': 'xxxx', 
                'doctor': 'yyyy' 
            }
        },
        'players': [
            { 
                'name': 'Viv', 
                'sex': 'F', 
                'stats': { 'goals': 101, 'assists': 40 } 
            },
            { 
                'name': 'Beth', 
                'sex': 'F', 
                'stats': { 'goals': 60, 'assists': 25 } 
            },
        ]
    },
    { 
        'team': 'city', 
        'colour': 'blue',
        'info': {
            'staff': { 
                'physio': 'aaaa', 
                'doctor': 'bbbb' 
            }
        },
        'players': [
            { 'name': 'Steph', 'sex': 'F' },
            { 'name': 'Lucy', 'sex': 'F' },
        ]
    },
]

pd.json_normalize(json_list)
team colour players info.staff.physio info.staff.doctor
0 arsenal red-white [{'name': 'Viv', 'sex': 'F', 'stats': {'goals'... xxxx yyyy
1 city blue [{'name': 'Steph', 'sex': 'F'}, {'name': 'Lucy... aaaa bbbb
pd.json_normalize(json_list, record_path =['players'])
name sex stats.goals stats.assists
0 Viv F 101.0 40.0
1 Beth F 60.0 25.0
2 Steph F NaN NaN
3 Lucy F NaN NaN

How about we now append the players' team, colour, and their physio.

pd.json_normalize(
    json_list, 
    record_path =['players'], 
    meta=['team', 'colour', ['info', 'staff', 'physio']]
)
name sex stats.goals stats.assists team colour info.staff.physio
0 Viv F 101.0 40.0 arsenal red-white xxxx
1 Beth F 60.0 25.0 arsenal red-white xxxx
2 Steph F NaN NaN city blue aaaa
3 Lucy F NaN NaN city blue aaaa

4. Ignoring key errors

json_list = [
    { 
        'team': 'arsenal', 
        'colour': 'red-white',
        'info': {
            'staff': { 
                'physio': 'xxxx', 
                'doctor': 'yyyy' 
            }
        },
        'players': [
            { 
                'name': 'Viv', 
                'sex': 'F', 
                'stats': { 'goals': 101, 'assists': 40 } 
            },
            { 
                'name': 'Beth', 
                'sex': 'F', 
                'stats': { 'goals': 60, 'assists': 25 } 
            },
        ]
    },
    { 
        'team': 'city', 
        'colour': 'blue',
        'info': {
            'staff': { 
                'doctor': 'bbbb' 
            }
        },
        'players': [
            { 'name': 'Steph', 'sex': 'F' },
            { 'name': 'Lucy', 'sex': 'F' },
        ]
    },
]

Notice that the key physio is missing from the entry team=city. What happens if we try to access physio key inside meta?

pd.json_normalize(
    json_list, 
    record_path =['players'], 
    meta=['team', 'colour', ['info', 'staff', 'physio']],
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/homebrew/Caskroom/miniforge/base/envs/jupnet/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    491                         try:
--> 492                             meta_val = _pull_field(obj, val[level:])
    493                         except KeyError as e:

/opt/homebrew/Caskroom/miniforge/base/envs/jupnet/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
    387             for field in spec:
--> 388                 result = result[field]
    389         else:

KeyError: 'physio'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/var/folders/27/bh1fc62j2yq48m_4vk_tgq_80000gp/T/ipykernel_8402/2719010332.py in <module>
----> 1 pd.json_normalize(
      2     json_list,
      3     record_path =['players'],
      4     meta=['team', 'colour', ['info', 'staff', 'physio']],
      5 )

/opt/homebrew/Caskroom/miniforge/base/envs/jupnet/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    502                 records.extend(recs)
    503 
--> 504     _recursive_extract(data, record_path, {}, level=0)
    505 
    506     result = DataFrame(records)

/opt/homebrew/Caskroom/miniforge/base/envs/jupnet/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    495                                 meta_val = np.nan
    496                             else:
--> 497                                 raise KeyError(
    498                                     "Try running with errors='ignore' as key "
    499                                     f"{e} is not always present"

KeyError: "Try running with errors='ignore' as key 'physio' is not always present"

How come stats.goals and stats.assists didn't generate an error but that above does? Because, the meta argument expects values to be present for listed keys in meta by default. We can ignore those errors(as suggested) using errors='ignore'

pd.json_normalize(
    json_list, 
    record_path =['players'], 
    meta=['team', 'colour', ['info', 'staff', 'physio']],
    errors='ignore'
)
name sex stats.goals stats.assists team colour info.staff.physio
0 Viv F 101.0 40.0 arsenal red-white xxxx
1 Beth F 60.0 25.0 arsenal red-white xxxx
2 Steph F NaN NaN city blue NaN
3 Lucy F NaN NaN city blue NaN

5. Custom separator sep

We notice that by default pandas uses . to indicate the direction of the path. We can change that using the sep argument.

Tip:Usually an underscore is used instead of .

json_list = [
    { 
        'team': 'arsenal', 
        'colour': 'red-white',
        'info': {
            'staff': { 
                'physio': 'xxxx', 
                'doctor': 'yyyy' 
            }
        },
        'players': [
            { 
                'name': 'Viv', 
                'sex': 'F', 
                'stats': { 'goals': 101, 'assists': 40 } 
            },
            { 
                'name': 'Beth', 
                'sex': 'F', 
                'stats': { 'goals': 60, 'assists': 25 } 
            },
        ]
    },
    { 
        'team': 'city', 
        'colour': 'blue',
        'info': {
            'staff': { 
                'physio': 'aaaa', 
                'doctor': 'bbbb' 
            }
        },
        'players': [
            { 'name': 'Steph', 'sex': 'F' },
            { 'name': 'Lucy', 'sex': 'F' },
        ]
    },
]
pd.json_normalize(
    json_list, 
    record_path =['players'], 
    meta=['team', 'colour', ['info', 'staff', 'physio']],
    sep='->'
)
name sex stats->goals stats->assists team colour info->staff->physio
0 Viv F 101.0 40.0 arsenal red-white xxxx
1 Beth F 60.0 25.0 arsenal red-white xxxx
2 Steph F NaN NaN city blue aaaa
3 Lucy F NaN NaN city blue aaaa

6. Adding context to record and meta data using record_prefix and meta_prefix

pd.json_normalize(
    json_list, 
    record_path=['players'], 
    meta=['team', 'colour', ['info', 'staff', 'physio']],
    meta_prefix='meta-',
    record_prefix='player-',
    sep='->'
)
player-name player-sex player-stats->goals player-stats->assists meta-team meta-colour meta-info->staff->physio
0 Viv F 101.0 40.0 arsenal red-white xxxx
1 Beth F 60.0 25.0 arsenal red-white xxxx
2 Steph F NaN NaN city blue aaaa
3 Lucy F NaN NaN city blue aaaa

7. Working with a local file

In most scenarios, we won't be making new JSON object ourselves instead use JSON formatted files. We make use python's json module and read the file, then use pandas' json_normalize to flatten it into a dataframe.

import json
# load data using Python JSON module
with open('my_attachment/movies.json') as f:
    data = json.load(f)
    
# Normalizing data
pd.json_normalize(data)
Title US Gross Worldwide Gross US DVD Sales Production Budget Release Date MPAA Rating Running Time min Distributor Source Major Genre Creative Type Director Rotten Tomatoes Rating IMDB Rating IMDB Votes
0 The Land Girls 146083 146083 NaN 8000000 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876 10876 NaN 300000 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134 203134 NaN 250000 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Four Rooms 4301000 4301000 NaN 4000000 Dec 25 1995 R NaN Miramax Original Screenplay Comedy Contemporary Fiction Robert Rodriguez 14.0 6.4 34328.0
4 The Four Seasons 42488161 42488161 NaN 6500000 May 22 1981 None NaN Universal Original Screenplay Comedy Contemporary Fiction Alan Alda 71.0 7.0 1814.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 Big Things 0 0 NaN 50000 Dec 31 2009 None NaN None None None None None NaN NaN NaN
64 Bogus 4357406 4357406 NaN 32000000 Sep 06 1996 PG NaN Warner Bros. Original Screenplay Comedy Fantasy Norman Jewison 40.0 4.8 2742.0
65 Beverly Hills Cop 234760478 316300000 NaN 15000000 Dec 05 1984 None NaN Paramount Pictures Original Screenplay Action Contemporary Fiction Martin Brest 83.0 7.3 45065.0
66 Beverly Hills Cop II 153665036 276665036 NaN 20000000 May 20 1987 R NaN Paramount Pictures Original Screenplay Action Contemporary Fiction Tony Scott 46.0 6.1 29712.0
67 Beverly Hills Cop III 42586861 119180938 NaN 50000000 May 25 1994 R NaN Paramount Pictures Original Screenplay Action Contemporary Fiction John Landis 10.0 5.0 21199.0

68 rows × 16 columns

8. Working with URL

Reading a JSON file from an url needs an extra module in requests as any data from the Internet carries overheads that are necessary for efficient exchange of information(REST API). So, in order to read the file contents, we call upon requests' text attribute which fetches the contents of the file.

Here, we use json.loads and not json.load as loads function expects contents(string) rather than a file pointer. If looked closely into the json module, the load calls loads using read() on the file.

import requests

URL = 'https://vega.github.io/vega-datasets/data/cars.json'

data = json.loads(requests.get(URL).text)
pd.json_normalize(data)
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA
... ... ... ... ... ... ... ... ... ...
401 ford mustang gl 27.0 4 140.0 86.0 2790 15.6 1982-01-01 USA
402 vw pickup 44.0 4 97.0 52.0 2130 24.6 1982-01-01 Europe
403 dodge rampage 32.0 4 135.0 84.0 2295 11.6 1982-01-01 USA
404 ford ranger 28.0 4 120.0 79.0 2625 18.6 1982-01-01 USA
405 chevy s-10 31.0 4 119.0 82.0 2720 19.4 1982-01-01 USA

406 rows × 9 columns

Conclusion

We saw the use of json_normalize function in pandas library. It helps take a JSON data, flatten it, and make it as a dataframe for easier analysis.

References

  1. B. Chen, https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd
  2. Pandas documentation, https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html