# Coauthor Network Analysis

This notebook provides comprehensive descriptive statistics and visualizations of the coauthor collaboration network.

## Overview
- **Network Structure**: Nodes represent authors, edges represent coauthorship relationships
- **Data Source**: Generated from Google Scholar via SerpAPI crawler
- **Depth**: 2-hop network from seed author (Christopher Neilson)


In [None]:
# Import necessary libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import networkx as nx
from typing import Dict, List, Any

# Set style for visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")


## 1. Load Data


In [None]:
# Load the coauthor network visualization data
with open('_data/coauthor_network_viz.json', 'r', encoding='utf-8') as f:
    network_data = json.load(f)

# Load the full graph data for more detailed analysis
with open('serpapi_coauthor_graph.json', 'r', encoding='utf-8') as f:
    graph_data = json.load(f)

print(f"Network Visualization Data:")
print(f"  - Nodes: {network_data['node_count']}")
print(f"  - Edges: {network_data['edge_count']}")
print(f"  - Max Depth: {network_data['max_depth']}")
print(f"  - Seed Author: {network_data['seed_author_id']}")
print(f"\nFull Graph Data:")
print(f"  - Authors: {len(graph_data['authors'])}")
print(f"  - Edges: {len(graph_data['edges'])}")


## 2. Data Preparation


In [None]:
# Create DataFrames for easier analysis
nodes_df = pd.DataFrame(network_data['nodes'])
edges_df = pd.DataFrame(network_data['edges'])

# Add computed columns
nodes_df['has_citations'] = nodes_df['citation_total'].notna()
nodes_df['has_website'] = nodes_df['website'].notna() & (nodes_df['website'] != '')
nodes_df['affiliation_count'] = nodes_df['affiliations'].apply(
    lambda x: len(x.split(',')) if isinstance(x, str) and x else 0
)

print("Node DataFrame:")
print(nodes_df.head())
print(f"\nShape: {nodes_df.shape}")
print(f"\nEdge DataFrame:")
print(edges_df.head())
print(f"\nShape: {edges_df.shape}")


## 3. Build NetworkX Graph


In [None]:
# Build NetworkX graph for network metrics
G = nx.Graph()

# Add nodes with attributes
for node in network_data['nodes']:
    G.add_node(
        node['id'],
        name=node['name'],
        depth=node['min_depth'],
        citations=node.get('citation_total', 0),
        is_seed=node.get('is_seed', False)
    )

# Add edges
for edge in network_data['edges']:
    G.add_edge(
        edge['source'],
        edge['target'],
        weight=edge.get('num_coauthored_papers', 1)
    )

print(f"Graph created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")


## 4. Network-Level Statistics


In [None]:
print("=" * 60)
print("NETWORK-LEVEL STATISTICS")
print("=" * 60)
print(f"\nBasic Network Properties:")
print(f"  - Number of nodes: {G.number_of_nodes()}")
print(f"  - Number of edges: {G.number_of_edges()}")
print(f"  - Network density: {nx.density(G):.4f}")
print(f"  - Is connected: {nx.is_connected(G)}")

if not nx.is_connected(G):
    components = list(nx.connected_components(G))
    print(f"  - Number of connected components: {nx.number_connected_components(G)}")
    print(f"  - Largest component size: {len(max(components, key=len))}")
    print(f"  - Smallest component size: {len(min(components, key=len))}")
    # Use largest component for some metrics
    G_main = G.subgraph(max(components, key=len)).copy()
else:
    G_main = G

print(f"\nCentrality Measures (on largest component):")
print(f"  - Average clustering coefficient: {nx.average_clustering(G_main):.4f}")
print(f"  - Average shortest path length: {nx.average_shortest_path_length(G_main):.4f}")
print(f"  - Network diameter: {nx.diameter(G_main)}")
print(f"  - Network radius: {nx.radius(G_main)}")

# Degree statistics
degrees = [d for n, d in G.degree()]
print(f"\nDegree Statistics:")
print(f"  - Average degree: {np.mean(degrees):.2f}")
print(f"  - Median degree: {np.median(degrees):.2f}")
print(f"  - Min degree: {np.min(degrees)}")
print(f"  - Max degree: {np.max(degrees)}")
print(f"  - Std deviation: {np.std(degrees):.2f}")


## 5. Author-Level Statistics


In [None]:
print("=" * 60)
print("AUTHOR-LEVEL STATISTICS")
print("=" * 60)

# Citation statistics
citations_df = nodes_df[nodes_df['citation_total'].notna()]
print(f"\nCitation Statistics (n={len(citations_df)} authors with data):")
print(f"  - Total citations: {citations_df['citation_total'].sum():,.0f}")
print(f"  - Average citations: {citations_df['citation_total'].mean():,.0f}")
print(f"  - Median citations: {citations_df['citation_total'].median():,.0f}")
print(f"  - Min citations: {citations_df['citation_total'].min():,.0f}")
print(f"  - Max citations: {citations_df['citation_total'].max():,.0f}")
print(f"  - Std deviation: {citations_df['citation_total'].std():,.0f}")

# Depth distribution
print(f"\nDepth Distribution:")
depth_counts = nodes_df['min_depth'].value_counts().sort_index()
for depth, count in depth_counts.items():
    pct = (count / len(nodes_df)) * 100
    print(f"  - Depth {depth}: {count} authors ({pct:.1f}%)")

# Coauthor count statistics
print(f"\nCoauthor Count Statistics:")
print(f"  - Average coauthors per author: {nodes_df['coauthor_count'].mean():.2f}")
print(f"  - Median coauthors: {nodes_df['coauthor_count'].median():.0f}")
print(f"  - Max coauthors: {nodes_df['coauthor_count'].max():.0f}")

# Article count statistics
print(f"\nArticle Count Statistics:")
print(f"  - Average articles per author: {nodes_df['article_count'].mean():.2f}")
print(f"  - Median articles: {nodes_df['article_count'].median():.0f}")
print(f"  - Max articles: {nodes_df['article_count'].max():.0f}")

# Metadata completeness
print(f"\nMetadata Completeness:")
print(f"  - Authors with citations: {nodes_df['has_citations'].sum()} ({nodes_df['has_citations'].mean()*100:.1f}%)")
print(f"  - Authors with websites: {nodes_df['has_website'].sum()} ({nodes_df['has_website'].mean()*100:.1f}%)")
print(f"  - Authors with affiliations: {(nodes_df['affiliations'] != '').sum()} ({((nodes_df['affiliations'] != '').mean())*100:.1f}%)")


## 6. Top Authors by Various Metrics


In [None]:
print("=" * 60)
print("TOP AUTHORS")
print("=" * 60)

# Top 10 by citations
print("\nTop 10 Authors by Citations:")
top_citations = nodes_df.nlargest(10, 'citation_total')[['name', 'citation_total', 'affiliations', 'min_depth']]
for idx, row in top_citations.iterrows():
    print(f"  {row['name']}: {row['citation_total']:,.0f} citations (depth {row['min_depth']})")
    if row['affiliations']:
        print(f"    → {row['affiliations'][:80]}{'...' if len(row['affiliations']) > 80 else ''}")

# Top 10 by coauthor count
print("\nTop 10 Authors by Coauthor Count:")
top_coauthors = nodes_df.nlargest(10, 'coauthor_count')[['name', 'coauthor_count', 'min_depth']]
for idx, row in top_coauthors.iterrows():
    print(f"  {row['name']}: {row['coauthor_count']} coauthors (depth {row['min_depth']})")

# Top 10 by article count
print("\nTop 10 Authors by Article Count:")
top_articles = nodes_df.nlargest(10, 'article_count')[['name', 'article_count', 'min_depth']]
for idx, row in top_articles.iterrows():
    print(f"  {row['name']}: {row['article_count']} articles (depth {row['min_depth']})")

# Most central authors (by degree centrality)
degree_centrality = nx.degree_centrality(G)
top_central = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 Authors by Degree Centrality (most connected):")
for author_id, centrality in top_central:
    author_name = nodes_df[nodes_df['id'] == author_id]['name'].values[0]
    print(f"  {author_name}: {centrality:.4f}")

# Betweenness centrality (bridge authors)
betweenness = nx.betweenness_centrality(G)
top_between = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:10]
print("\nTop 10 Authors by Betweenness Centrality (bridge authors):")
for author_id, centrality in top_between:
    author_name = nodes_df[nodes_df['id'] == author_id]['name'].values[0]
    print(f"  {author_name}: {centrality:.4f}")


## 7. Institutional Analysis


In [None]:
print("=" * 60)
print("INSTITUTIONAL ANALYSIS")
print("=" * 60)

# Extract institutions from affiliations
def extract_institutions(affiliation_str):
    """Extract institution names from affiliation string."""
    if not isinstance(affiliation_str, str) or not affiliation_str:
        return []
    # Split by common separators and clean
    parts = affiliation_str.replace(';', ',').split(',')
    return [p.strip() for p in parts if p.strip()]

# Count institutions
institution_counter = Counter()
for affiliations in nodes_df['affiliations']:
    institutions = extract_institutions(affiliations)
    institution_counter.update(institutions)

print("\nTop 20 Institutions by Author Count:")
for institution, count in institution_counter.most_common(20):
    pct = (count / len(nodes_df)) * 100
    print(f"  {institution}: {count} authors ({pct:.1f}%)")

# Group institutions by common keywords
university_counts = defaultdict(list)
for inst, count in institution_counter.items():
    inst_lower = inst.lower()
    if 'yale' in inst_lower:
        university_counts['Yale'].append(count)
    elif 'princeton' in inst_lower:
        university_counts['Princeton'].append(count)
    elif 'andes' in inst_lower:
        university_counts['Andes'].append(count)
    elif 'stanford' in inst_lower:
        university_counts['Stanford'].append(count)
    elif 'harvard' in inst_lower:
        university_counts['Harvard'].append(count)
    elif 'mit' in inst_lower or 'massachusetts institute' in inst_lower:
        university_counts['MIT'].append(count)
    elif 'berkeley' in inst_lower:
        university_counts['Berkeley'].append(count)
    elif 'chicago' in inst_lower:
        university_counts['Chicago'].append(count)

print("\nKey University Groups:")
for uni, counts in sorted(university_counts.items(), key=lambda x: sum(x[1]), reverse=True):
    print(f"  {uni}: {sum(counts)} authors")


## 8. Collaboration Edge Analysis


In [None]:
print("=" * 60)
print("COLLABORATION EDGE ANALYSIS")
print("=" * 60)

# Papers per collaboration
papers_per_collab = edges_df['num_coauthored_papers']
print(f"\nCoauthored Papers per Edge:")
print(f"  - Average: {papers_per_collab.mean():.2f}")
print(f"  - Median: {papers_per_collab.median():.0f}")
print(f"  - Min: {papers_per_collab.min():.0f}")
print(f"  - Max: {papers_per_collab.max():.0f}")
print(f"  - Std deviation: {papers_per_collab.std():.2f}")

# Distribution of collaboration strength
print(f"\nCollaboration Strength Distribution:")
strength_bins = [0, 1, 2, 3, 5, 10, float('inf')]
strength_labels = ['1 paper', '2 papers', '3 papers', '4-5 papers', '6-10 papers', '10+ papers']
strength_dist = pd.cut(papers_per_collab, bins=strength_bins, labels=strength_labels, right=False)
for label in strength_labels:
    count = (strength_dist == label).sum()
    pct = (count / len(edges_df)) * 100
    print(f"  {label}: {count} collaborations ({pct:.1f}%)")

# Strongest collaborations
print("\nTop 15 Strongest Collaborations (by number of papers):")
top_collabs = edges_df.nlargest(15, 'num_coauthored_papers')
for idx, edge in top_collabs.iterrows():
    source_name = nodes_df[nodes_df['id'] == edge['source']]['name'].values[0]
    target_name = nodes_df[nodes_df['id'] == edge['target']]['name'].values[0]
    print(f"  {source_name} ↔ {target_name}: {edge['num_coauthored_papers']} papers")


## 9. Visualizations

### 9.1 Degree Distribution


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Histogram of degrees
degrees = [d for n, d in G.degree()]
ax1.hist(degrees, bins=30, edgecolor='black', alpha=0.7)
ax1.set_xlabel('Degree (Number of Coauthors)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Coauthor Network Degree', fontsize=14, fontweight='bold')
ax1.axvline(np.mean(degrees), color='red', linestyle='--', label=f'Mean: {np.mean(degrees):.1f}')
ax1.axvline(np.median(degrees), color='green', linestyle='--', label=f'Median: {np.median(degrees):.1f}')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Log-log plot for power law check
degree_counts = Counter(degrees)
degrees_sorted = sorted(degree_counts.keys())
counts_sorted = [degree_counts[d] for d in degrees_sorted]
ax2.loglog(degrees_sorted, counts_sorted, 'bo-', alpha=0.6)
ax2.set_xlabel('Degree (log scale)', fontsize=12)
ax2.set_ylabel('Frequency (log scale)', fontsize=12)
ax2.set_title('Degree Distribution (Log-Log)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.savefig('coauthor_degree_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_degree_distribution.png")


### 9.2 Citation Distribution


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Citation histogram
citations = nodes_df['citation_total'].dropna()
ax1.hist(citations, bins=50, edgecolor='black', alpha=0.7, color='coral')
ax1.set_xlabel('Citation Count', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Citation Counts', fontsize=14, fontweight='bold')
ax1.axvline(citations.mean(), color='red', linestyle='--', label=f'Mean: {citations.mean():,.0f}')
ax1.axvline(citations.median(), color='green', linestyle='--', label=f'Median: {citations.median():,.0f}')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Log scale version
ax2.hist(np.log10(citations + 1), bins=50, edgecolor='black', alpha=0.7, color='lightblue')
ax2.set_xlabel('Log10(Citation Count + 1)', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Distribution of Citation Counts (Log Scale)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('coauthor_citation_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_citation_distribution.png")


### 9.3 Network Depth Analysis


In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Depth distribution
depth_counts = nodes_df['min_depth'].value_counts().sort_index()
ax1.bar(depth_counts.index, depth_counts.values, edgecolor='black', alpha=0.7, color='steelblue')
ax1.set_xlabel('Network Depth', fontsize=12)
ax1.set_ylabel('Number of Authors', fontsize=12)
ax1.set_title('Authors by Network Depth', fontsize=14, fontweight='bold')
ax1.set_xticks(depth_counts.index)
ax1.grid(True, alpha=0.3, axis='y')
for i, v in enumerate(depth_counts.values):
    ax1.text(depth_counts.index[i], v + 2, str(v), ha='center', fontweight='bold')

# Citations by depth (boxplot)
depth_citations = [nodes_df[nodes_df['min_depth'] == d]['citation_total'].dropna().values 
                   for d in sorted(nodes_df['min_depth'].unique())]
bp = ax2.boxplot(depth_citations, labels=sorted(nodes_df['min_depth'].unique()), patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightgreen')
    patch.set_alpha(0.7)
ax2.set_xlabel('Network Depth', fontsize=12)
ax2.set_ylabel('Citation Count', fontsize=12)
ax2.set_title('Citation Distribution by Depth', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_yscale('log')

# Coauthor count by depth
depth_coauthors = nodes_df.groupby('min_depth')['coauthor_count'].mean()
ax3.bar(depth_coauthors.index, depth_coauthors.values, edgecolor='black', alpha=0.7, color='mediumpurple')
ax3.set_xlabel('Network Depth', fontsize=12)
ax3.set_ylabel('Average Coauthor Count', fontsize=12)
ax3.set_title('Average Coauthors by Depth', fontsize=14, fontweight='bold')
ax3.set_xticks(depth_coauthors.index)
ax3.grid(True, alpha=0.3, axis='y')
for i, v in enumerate(depth_coauthors.values):
    ax3.text(depth_coauthors.index[i], v + 0.1, f'{v:.1f}', ha='center', fontweight='bold')

# Article count by depth
depth_articles = nodes_df.groupby('min_depth')['article_count'].mean()
ax4.bar(depth_articles.index, depth_articles.values, edgecolor='black', alpha=0.7, color='salmon')
ax4.set_xlabel('Network Depth', fontsize=12)
ax4.set_ylabel('Average Article Count', fontsize=12)
ax4.set_title('Average Articles by Depth', fontsize=14, fontweight='bold')
ax4.set_xticks(depth_articles.index)
ax4.grid(True, alpha=0.3, axis='y')
for i, v in enumerate(depth_articles.values):
    ax4.text(depth_articles.index[i], v + 0.2, f'{v:.1f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('coauthor_depth_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_depth_analysis.png")


### 9.4 Top Institutions


In [None]:
# Top 15 institutions
top_institutions = institution_counter.most_common(15)
inst_names = [inst for inst, _ in top_institutions]
inst_counts = [count for _, count in top_institutions]

fig, ax = plt.subplots(figsize=(12, 8))
y_pos = np.arange(len(inst_names))
bars = ax.barh(y_pos, inst_counts, color='darkblue', alpha=0.7, edgecolor='black')

# Color top 3 differently
bars[0].set_color('gold')
bars[1].set_color('silver')
bars[2].set_color('#CD7F32')  # bronze

ax.set_yticks(y_pos)
ax.set_yticklabels(inst_names)
ax.invert_yaxis()
ax.set_xlabel('Number of Authors', fontsize=12)
ax.set_title('Top 15 Institutions in Coauthor Network', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, v in enumerate(inst_counts):
    ax.text(v + 0.5, i, str(v), va='center', fontweight='bold')

plt.tight_layout()
plt.savefig('coauthor_top_institutions.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_top_institutions.png")


### 9.5 Collaboration Strength


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Distribution of papers per collaboration
papers_per_collab = edges_df['num_coauthored_papers']
ax1.hist(papers_per_collab, bins=range(1, int(papers_per_collab.max()) + 2), 
         edgecolor='black', alpha=0.7, color='teal')
ax1.set_xlabel('Number of Coauthored Papers', fontsize=12)
ax1.set_ylabel('Number of Collaborations', fontsize=12)
ax1.set_title('Distribution of Collaboration Strength', fontsize=14, fontweight='bold')
ax1.axvline(papers_per_collab.mean(), color='red', linestyle='--', 
            label=f'Mean: {papers_per_collab.mean():.1f}')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Cumulative distribution
papers_sorted = np.sort(papers_per_collab)
cumulative = np.arange(1, len(papers_sorted) + 1) / len(papers_sorted)
ax2.plot(papers_sorted, cumulative, linewidth=2, color='darkgreen')
ax2.set_xlabel('Number of Coauthored Papers', fontsize=12)
ax2.set_ylabel('Cumulative Proportion', fontsize=12)
ax2.set_title('Cumulative Distribution of Collaboration Strength', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.axhline(0.5, color='red', linestyle='--', alpha=0.5, label='Median')
ax2.legend()

plt.tight_layout()
plt.savefig('coauthor_collaboration_strength.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_collaboration_strength.png")


### 9.6 Correlation Analysis


In [None]:
# Compute correlations
degree_dict = dict(G.degree())
nodes_df['degree'] = nodes_df['id'].map(degree_dict)

# Select numeric columns for correlation
corr_cols = ['citation_total', 'coauthor_count', 'article_count', 'degree', 'min_depth']
corr_data = nodes_df[corr_cols].dropna()
correlation_matrix = corr_data.corr()

# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            vmin=-1, vmax=1, ax=ax)
ax.set_title('Correlation Matrix of Network Metrics', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('coauthor_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure saved as: coauthor_correlation_matrix.png")
print("\nCorrelation Matrix:")
print(correlation_matrix)


## 10. Community Detection


In [None]:
from networkx.algorithms import community

print("=" * 60)
print("COMMUNITY DETECTION")
print("=" * 60)

# Detect communities using Louvain method
communities = community.louvain_communities(G, seed=42)

print(f"\nNumber of communities detected: {len(communities)}")
print(f"\nCommunity sizes:")
community_sizes = sorted([len(c) for c in communities], reverse=True)
for i, size in enumerate(community_sizes, 1):
    print(f"  Community {i}: {size} members")

# Analyze largest communities
print(f"\nTop 5 Communities by Size:")
communities_sorted = sorted(communities, key=len, reverse=True)
for i, comm in enumerate(communities_sorted[:5], 1):
    print(f"\n  === Community {i} ({len(comm)} members) ===")
    # Get top members by citations
    comm_nodes = nodes_df[nodes_df['id'].isin(comm)]
    top_members = comm_nodes.nlargest(5, 'citation_total')[['name', 'citation_total', 'affiliations']]
    print("  Top members:")
    for idx, row in top_members.iterrows():
        print(f"    - {row['name']} ({row['citation_total']:,.0f} citations)")
        if row['affiliations']:
            print(f"      {row['affiliations'][:60]}{'...' if len(row['affiliations']) > 60 else ''}")
    
    # Common institutions in this community
    comm_institutions = Counter()
    for affiliations in comm_nodes['affiliations']:
        comm_institutions.update(extract_institutions(affiliations))
    if comm_institutions:
        print("  Top institutions:")
        for inst, count in comm_institutions.most_common(3):
            print(f"    - {inst}: {count}")

# Modularity score
modularity = community.modularity(G, communities)
print(f"\nNetwork modularity: {modularity:.4f}")
print("(Higher modularity indicates stronger community structure, max = 1.0)")


## 11. Export Summary Statistics


In [None]:
# Create summary statistics dictionary
summary_stats = {
    "network_metadata": {
        "seed_author_id": network_data['seed_author_id'],
        "max_depth": network_data['max_depth'],
        "generated_at": network_data['generated_at'],
        "node_count": network_data['node_count'],
        "edge_count": network_data['edge_count']
    },
    "network_properties": {
        "density": float(nx.density(G)),
        "is_connected": nx.is_connected(G),
        "num_components": nx.number_connected_components(G),
        "avg_clustering": float(nx.average_clustering(G_main)),
        "avg_shortest_path": float(nx.average_shortest_path_length(G_main)),
        "diameter": int(nx.diameter(G_main)),
        "radius": int(nx.radius(G_main)),
        "modularity": float(modularity)
    },
    "degree_statistics": {
        "mean": float(np.mean(degrees)),
        "median": float(np.median(degrees)),
        "min": int(np.min(degrees)),
        "max": int(np.max(degrees)),
        "std": float(np.std(degrees))
    },
    "citation_statistics": {
        "total": int(citations_df['citation_total'].sum()),
        "mean": float(citations_df['citation_total'].mean()),
        "median": float(citations_df['citation_total'].median()),
        "min": int(citations_df['citation_total'].min()),
        "max": int(citations_df['citation_total'].max()),
        "std": float(citations_df['citation_total'].std()),
        "authors_with_data": int(len(citations_df))
    },
    "depth_distribution": {
        str(int(k)): int(v) for k, v in depth_counts.items()
    },
    "collaboration_statistics": {
        "mean_papers_per_edge": float(papers_per_collab.mean()),
        "median_papers_per_edge": float(papers_per_collab.median()),
        "max_papers_per_edge": int(papers_per_collab.max())
    },
    "community_statistics": {
        "num_communities": len(communities),
        "largest_community_size": int(max(len(c) for c in communities)),
        "smallest_community_size": int(min(len(c) for c in communities)),
        "mean_community_size": float(np.mean([len(c) for c in communities]))
    },
    "top_institutions": [
        {"institution": inst, "count": count}
        for inst, count in institution_counter.most_common(20)
    ],
    "interests_statistics": {
        "authors_with_interests": int(authors_with_interests) if 'authors_with_interests' in locals() else 0,
        "unique_interests": len(all_interests) if 'all_interests' in locals() else 0,
        "total_interest_mentions": sum(all_interests.values()) if 'all_interests' in locals() else 0,
        "top_interests": [
            {"interest": interest, "count": count}
            for interest, count in (all_interests.most_common(20) if 'all_interests' in locals() else [])
        ]
    }
}

# Save to JSON
with open('coauthor_network_summary_stats.json', 'w', encoding='utf-8') as f:
    json.dump(summary_stats, f, indent=2, ensure_ascii=False)

print("Summary statistics exported to: coauthor_network_summary_stats.json")
print("\nGenerated visualizations:")
print("  1. coauthor_degree_distribution.png")
print("  2. coauthor_citation_distribution.png")
print("  3. coauthor_depth_analysis.png")
print("  4. coauthor_top_institutions.png")
print("  5. coauthor_collaboration_strength.png")
print("  6. coauthor_correlation_matrix.png")
print("  7. coauthor_research_interests.png (if interests data available)")


## 12. Research Interests / Fields of Study Analysis
ting

In [None]:
print("=" * 60)
print("RESEARCH INTERESTS ANALYSIS")
print("=" * 60)

# Extract all interests from authors
all_interests = Counter()
authors_with_interests = 0
interests_by_depth = defaultdict(Counter)

for idx, row in nodes_df.iterrows():
    author_id = row['id']
    if author_id not in graph_data['authors']:
        continue
    
    author_entry = graph_data['authors'][author_id]
    interests = author_entry.get('interests', [])
    
    if interests and isinstance(interests, list) and len(interests) > 0:
        authors_with_interests += 1
        depth = row['min_depth']
        
        for interest in interests:
            if isinstance(interest, dict):
                title = interest.get('title', '').strip()
                if title:
                    all_interests[title] += 1
                    interests_by_depth[depth][title] += 1
            elif isinstance(interest, str):
                title = interest.strip()
                if title:
                    all_interests[title] += 1
                    interests_by_depth[depth][title] += 1

print(f"\nInterests Coverage:")
print(f"  - Authors with interests data: {authors_with_interests} ({authors_with_interests/len(nodes_df)*100:.1f}%)")
print(f"  - Unique research interests: {len(all_interests)}")
print(f"  - Total interest mentions: {sum(all_interests.values())}")
print(f"  - Average interests per author (with data): {sum(all_interests.values())/authors_with_interests:.2f}" if authors_with_interests > 0 else "  - No interests data")

if all_interests:
    print(f"\nTop 30 Research Interests:")
    for interest, count in all_interests.most_common(30):
        pct = (count / len(nodes_df)) * 100
        print(f"  {interest}: {count} authors ({pct:.1f}%)")

    # Interests by depth
    print(f"\nTop Interests by Network Depth:")
    for depth in sorted(interests_by_depth.keys()):
        print(f"\n  Depth {depth}:")
        for interest, count in interests_by_depth[depth].most_common(10):
            print(f"    {interest}: {count}")
else:
    print("\nNo interests data found. Run the crawler again to collect interests.")
    print("Use: python scripts/crawl_coauthor_graph.py --max-depth 2")


### 12.1 Research Interest Visualization


In [None]:
if all_interests and len(all_interests) > 0:
    # Top 20 interests bar chart
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
    
    top_interests = all_interests.most_common(20)
    interest_names = [i[0] for i in top_interests]
    interest_counts = [i[1] for i in top_interests]
    
    y_pos = np.arange(len(interest_names))
    bars = ax1.barh(y_pos, interest_counts, color='steelblue', alpha=0.7, edgecolor='black')
    
    # Color top 3 differently
    if len(bars) >= 1:
        bars[0].set_color('gold')
    if len(bars) >= 2:
        bars[1].set_color('silver')
    if len(bars) >= 3:
        bars[2].set_color('#CD7F32')  # bronze
    
    ax1.set_yticks(y_pos)
    ax1.set_yticklabels(interest_names, fontsize=9)
    ax1.invert_yaxis()
    ax1.set_xlabel('Number of Authors', fontsize=12)
    ax1.set_title('Top 20 Research Interests in Network', fontsize=14, fontweight='bold')
    ax1.grid(True, alpha=0.3, axis='x')
    
    # Add value labels
    for i, v in enumerate(interest_counts):
        ax1.text(v + 0.5, i, str(v), va='center', fontweight='bold', fontsize=9)
    
    # Interests per author distribution
    interests_per_author = []
    for idx, row in nodes_df.iterrows():
        author_id = row['id']
        if author_id in graph_data['authors']:
            author_entry = graph_data['authors'][author_id]
            interests = author_entry.get('interests', [])
            if interests and isinstance(interests, list):
                interests_per_author.append(len(interests))
    
    if interests_per_author:
        ax2.hist(interests_per_author, bins=range(0, max(interests_per_author) + 2), 
                 edgecolor='black', alpha=0.7, color='coral')
        ax2.set_xlabel('Number of Interests per Author', fontsize=12)
        ax2.set_ylabel('Number of Authors', fontsize=12)
        ax2.set_title('Distribution of Interests per Author', fontsize=14, fontweight='bold')
        ax2.axvline(np.mean(interests_per_author), color='red', linestyle='--', 
                    label=f'Mean: {np.mean(interests_per_author):.1f}')
        ax2.legend()
        ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('coauthor_research_interests.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("Figure saved as: coauthor_research_interests.png")
else:
    print("No interests data available to visualize.")


### 12.2 Interest Co-occurrence Analysis


## Summary

This notebook has provided comprehensive analysis of the coauthor network including:

1. **Network-level statistics**: Overall structure, connectivity, and centrality measures
2. **Author-level statistics**: Citations, productivity, and metadata completeness
3. **Top authors**: By various metrics (citations, connections, centrality)
4. **Institutional analysis**: Distribution across universities and research centers
5. **Collaboration patterns**: Strength and distribution of coauthorship links
6. **Visualizations**: 6 comprehensive figures showing various aspects of the network
7. **Community detection**: Identification of research communities within the network
8. **Correlation analysis**: Relationships between different network metrics

All summary statistics have been exported to `coauthor_network_summary_stats.json` for further use.


## Summary

This notebook has provided comprehensive analysis of the coauthor network including:

1. **Network-level statistics**: Overall structure, connectivity, and centrality measures
2. **Author-level statistics**: Citations, productivity, and metadata completeness
3. **Top authors**: By various metrics (citations, connections, centrality)
4. **Institutional analysis**: Distribution across universities and research centers
5. **Collaboration patterns**: Strength and distribution of coauthorship links
6. **Visualizations**: 7 comprehensive figures showing various aspects of the network
7. **Community detection**: Identification of research communities within the network
8. **Correlation analysis**: Relationships between different network metrics
9. **Research interests analysis**: Fields of study distribution and co-occurrence patterns

All summary statistics have been exported to `coauthor_network_summary_stats.json` for further use.

### Notes on Research Interests

The research interests (fields of study) data is collected from Google Scholar author profiles. If you see no interests data:
- The data may not have been collected yet (requires re-running the crawler)
- Run: `python scripts/crawl_coauthor_graph.py --max-depth 2` to collect interests
- Then regenerate visualization data: `python scripts/generate_coauthor_viz_data.py`
