Thursday, December 10, 2020

Pandas change string to new dataframe

 Today I was given a challenge where someone was extracting JSON data and wanting to write it to a database and thanks to https://www.dataquest.io/blog/sql-insert-tutorial/ this gave the relevant information to enable the JSON data to be placed into a Pandas DataFrame and then grab the column names and the data and generate the necessary SQL.

HOWEVER

The data that was given had an array of dictionaries within it causing the DataFrame to show it as a string!

So this information was dropped from the DataFrame so that the core data for that table could be stored.

A second DataFrame was then required so the the Security Group data for each instance could then be stored into the database in its own table.

For this to happen I first created a new DataFrame that would contain;

  • Instance ID
  • Security Group array of dictionaries as a string
From here it was then a case of breaking it down so that the string and the instance ID could become a new record for a new DataFrame.


data = pd.read_json('ec2_data.json')
sg=data[['instance_id','sec_groups']]
a=[]
for x in sg.values:
    for y in x[1]:
        b=dict()
         b={"instance_id": x[0]}
         b.update(y)
         a.append(b)
df2=pd.DataFrame.from_dict(a)
print(df2)

  • The idea was to create the new DataFrame called sg from the original, but only taking the 2 columns we needed.
  • An empty array was then created so that the new DataFrame (df2) could be created since each record in a DataFrame is a dictionary from an array element.
  • We then iterate over the values of the sg DataFrame and within that iterate over the security groups since x[1] contains an array of security group dictionary objects.
  • A temporary dictionary is then created (b) to store the data of instance id, security group id and security group name.
  • That dictionary is then added to the array a.
  • This continues for all the data, and if there happens to be an instance with more than 1 security group then the instance ID will appear twice in the final DataFrame, but with the different security group data for each record.