An early case for understanding and standardizing best practice for the concept of Interoperable Avatars was addressed in definition of the ISO standard for Humanoid Animation
https://www.web3d.org/documents/specifications/19774-1/V2.0/index.html
For the HAnim concept, the prime motivation is the ability to transport animations and interactions between similar Humanoid characters. It was seen that general requirements are first a common coordinate space. The Humanoid is operating in own local coordinate system and the character must function in the global, or parent, coordinate system of its environment.
For best results, these spaces should be identical so the Humanoid is dimensioned in Human scale, 1:1, with all measurements referenced to 0 0 0 at the floor between the feet, facing +Z, with +X to its left and Yup. Next, the skeleton is defined using using the prior to animation default pose described as an I-pose, relaxed attention. In this pose all joint transforms are 0 0 1 0 meaning that if you were positioned on the transform, you would be facing +Z, and using right-hand rule, control joint orientation using Pitch (X-axis), Yaw (Y-axis) and Roll (Z-axis) rotations. Using this as the initial, default pose simplifies transport of animations because the user knows the value of the initial transforms. If the skeleton is initialized to this state, then it can be move to any convenient pose for other operations, just that it should be able to accomplish this basic pose when set to the default pose.
HAnim defines 5 levels of articulation, for use in examples from a simple human-oriented data base to a realistically complete human model capable of fully realistic simulations. HAnim includes a set of surface feature landmarks, called Sites, to provide a standardized set or interaction and data points.
HAnim defines two levels of complexity for geometries used in the model. Level 1 is called Segment geometry because it represents a model where the surface is composed of geometries directly associated with connections between Joint objects or to end-effectors. When a Joint is animated child geometries and skeleton structures move in direct response to parent joint rotations. Level 2 can also use deformable mesh skin. Each vertex of the 'skin' geometry is bound to one or more Joints and is moved according to a weighted value derived from the rotations of controlling Joint objects. In addition, individual or sets of points for any geometry can be animated using a scalar-driven displacer technique.
I hope this simple summary gives you the idea that a lot of work has been done and should be leveraged to the max for this current Interoperable concept. As I read the latest 3GPP specification I think I see that concepts such as default scale, default coordinate space, and default pose are yet to be defined. There is example of a candidate skeleton structure (close to HAnim LOA2), three examples of continuous-mesh skin at different density of points, and the statement that there is a fundamental set of points that remain the same as the point density decreases. This is important in transporting joint(s)-to-skin vertex bindings when the skin is changed. Also, work has been done to define landmarks for detailed facial animation. This is also a project in the current x3d HAnim Working Group, coordinating with mpeg.
Finally, this effort is aimed at realtime interactions with other realtime simulations ranging from a relatively simple data base to a fully competent virtual twin, not a character used to make a video, so certain performance issues should be addressed.
https://www.web3d.org/working-groups/hanim
Thank You All and Best Regards,
Joe